I hope popularity of large models like this one drives more work on CPU inference of quantized models. It is extremely disappointing one can't run 4 or even 8 bit quantized models on a cpu. Inference I did with fp32 on a last gen AVX2 CPU show me it is definitely usable if you're willing to wait a bit longer for each token (I got about 1token per 2s on a ryzen 3700x, 32GB ram, with falcon-7B-instruct and this is with about 1gb of ram in the swap).

I don't quite understand why people aren't working on cpu quantization. Allegedly openvino supports _some_ cpu quantization, but certainly not 4 bit. Bitsandbytes is gpu only.

Why? Is there any technical reasons? I recently checked and for a price of a 24gb rtx3090 I can get a really nice cpu (ryzen 9 5950x) and max it with 128gb of ram. I'd love to be able to use it for int8 or 4 bit inference...