I hope popularity of large models like this one drives more work on CPU inference of quantized models. It is extremely disappointing one can't run 4 or even 8 bit quantized models on a cpu. Inference I did with fp32 on a last gen AVX2 CPU show me it is definitely usable if you're willing to wait a bit longer for each token (I got about 1token per 2s on a ryzen 3700x, 32GB ram, with falcon-7B-instruct and this is with about 1gb of ram in the swap).

I don't quite understand why people aren't working on cpu quantization. Allegedly openvino supports _some_ cpu quantization, but certainly not 4 bit. Bitsandbytes is gpu only.

Why? Is there any technical reasons? I recently checked and for a price of a 24gb rtx3090 I can get a really nice cpu (ryzen 9 5950x) and max it with 128gb of ram. I'd love to be able to use it for int8 or 4 bit inference...

https://github.com/ggerganov/ggml

TinyGrad is also targeting CPU inference, and IIRC it works ok in Apache TVM.

One note is that prompt ingestion is extremely slow on CPU compared to GPU. So short prompts are fine (and tokens can be streamed once the prompt is ingested), but long prompts feel extremely sluggish.

Another is that CPUs with more than 128-bit DDR5 memory busses are very expensive, and CPU token generation is basically RAM bandwidth bound.