Few things to note:

- Google qps is closer to 100k then 320k [1]

- Not every query has to run on LLM. Probably only 10% would benefit from it

- This means 10,000 queries per second, each needing 5 A100s to run, so 50,000 A100s are sufficient. Cost for that is $500MM, quadruple that to $2B with CPU/RAM/storage/network. That is peanuts for Google.

- Let's say AI unlocks new, never seen before queries ('do more people live in madrid or tel aviv') and those are 10x in volume. So capex is now $20B, still peanuts.

- Latency, not cost, is a bigger issue. This should be addressed soon by H100 and newer chips.

- The main issue for the consumer remains the business model. Who will pay for LLMs (user or the advertiser) and will Google and Microsoft stuff ads into LLM responses?

[1] https://www.internetlivestats.com/one-second/#google-band

I would expect specific hardware acceleration for LLM if it becomes very relevant. That would improve latency and costs.

I use ChatGPT throughout the day as a widely knowledgeable coworker that periodically eats a bag of mushrooms, and have paid the $20 / month. I’d note this description matches all the most useful coworkers I’ve had, except that they cost a ton more per month.

Before spending $$$ on the hardware accelerators, there is still a lot to achieve using software optimizations for inference and trainings workloads. We optimized GPT-J, and others, for real. https://centml.ai/benchmarks/

You could also port that code from nVidia CUDA somewhere else. If you wanna run on servers, Vulkan Compute will probably do the trick. I have ported ML inference to Windows desktops with DirectCompute https://github.com/Const-me/Whisper

These NVIDIA A40 GPUs mentioned on the page you’ve linked cost $4000 (Amazon) to $10000 (Dell), deliver similar performance to Intel A770 which cost $350. That’s 1 order of magnitude difference in cost efficiency.