TinyGrad is also targeting CPU inference, and IIRC it works ok in Apache TVM.
One note is that prompt ingestion is extremely slow on CPU compared to GPU. So short prompts are fine (and tokens can be streamed once the prompt is ingested), but long prompts feel extremely sluggish.
Another is that CPUs with more than 128-bit DDR5 memory busses are very expensive, and CPU token generation is basically RAM bandwidth bound.
llama.cpp and alpaca.cpp (and other derivatives) all require model weights to be converted to the ggml format to run.
gptj_model_load: ggml ctx size = 13334.86 MB
gptj_model_load: memory_size = 1792.00 MB, n_mem = 57344
gptj_model_load: model size = 11542.79 MB / num tensors = 285
main: number of tokens in prompt = 12
An example of GPT-J running on the CPU is shown in Fig. [4](#Fig4
main: mem per token = 16179460 bytes
main: load time = 7463.20 ms
main: sample time = 3.24 ms
main: predict time = 4887.26 ms / 232.73 ms per token
main: total time = 13203.91 ms
I tested it on my Linux and Macbook M1 Air and it generates tokens at a reasonable speed using CPU only. I noticed it doesn't quite use all my available CPU cores so it may be leaving some performance on the table, not sure though.
The GPT-J 6B is nowhere near as large as the OPT-175B in the post. But I got the sense that CPU-only inference may not be totally hopeless even for large models if only we got some high quality software to do it.