Few things to note:
- Google qps is closer to 100k then 320k [1]
- Not every query has to run on LLM. Probably only 10% would benefit from it
- This means 10,000 queries per second, each needing 5 A100s to run, so 50,000 A100s are sufficient. Cost for that is $500MM, quadruple that to $2B with CPU/RAM/storage/network. That is peanuts for Google.
- Let's say AI unlocks new, never seen before queries ('do more people live in madrid or tel aviv') and those are 10x in volume. So capex is now $20B, still peanuts.
- Latency, not cost, is a bigger issue. This should be addressed soon by H100 and newer chips.
- The main issue for the consumer remains the business model. Who will pay for LLMs (user or the advertiser) and will Google and Microsoft stuff ads into LLM responses?
[1] https://www.internetlivestats.com/one-second/#google-band
I would expect specific hardware acceleration for LLM if it becomes very relevant. That would improve latency and costs.
I use ChatGPT throughout the day as a widely knowledgeable coworker that periodically eats a bag of mushrooms, and have paid the $20 / month. I’d note this description matches all the most useful coworkers I’ve had, except that they cost a ton more per month.
Before spending $$$ on the hardware accelerators, there is still a lot to achieve using software optimizations for inference and trainings workloads. We optimized GPT-J, and others, for real. https://centml.ai/benchmarks/
These NVIDIA A40 GPUs mentioned on the page you’ve linked cost $4000 (Amazon) to $10000 (Dell), deliver similar performance to Intel A770 which cost $350. That’s 1 order of magnitude difference in cost efficiency.