It is useful to mention running inference on modern cpus that have AVX2 is not that bad. Sure it is slower than on the gpu, but you get the benefit of having a single long continuous region of ram.

But there is one huge problem why this is not that popular on x86_64. Having to run in fp32. As far as I know our most common ml libraries (pytorch, tf, onnx etc) do not have an option to quantize to 4 bits and they don't have an option to run inference at anything other than fp32 on the x86_64 cpus.

It is a huge shame. There is openvino which supports int8, but if you can't easily quantize large models without a gpu, what use is it? (For small models I suppose).

So if anyone figured out a way to quantize a transformer model to 4/8 bit and run it on the x86_64 cpu platform I'm very interested in hearing about it.

Sorry, but the topic of this post, llama.cpp, runs quantized 4/8 bit models just fine on x86_64 with AVX2, or am I missing some requirement you have?

Wait, I wasn't aware llama.cpp even runs on x86_64.i thought it is arm hw only. If what you say is correct that indeed is very interesting. Especially if I can extend it to other models like falcon.

It doesn't support Falcon right now, but there's a fork that does (https://github.com/cmp-nct/ggllm.cpp/).