How realistic is CPU-only inference in the near future?
You can see for yourself (assuming you have the model weights) https://github.com/abetlen/llama-cpp-python
I get around ~140 ms per token running a 13B parameter model on a thinkpad laptop with a 14 core Intel i7-9750 processor. Because it's CPU inference the initial prompt processing takes longer than on GPU so total latency is still higher than I'd like. I'm working on some caching solutions that should make this bareable for things like chat.