What does HackerNews think of ggml?

Tensor library for machine learning

Language: C

https://github.com/ggerganov/ggml

TinyGrad is also targeting CPU inference, and IIRC it works ok in Apache TVM.

One note is that prompt ingestion is extremely slow on CPU compared to GPU. So short prompts are fine (and tokens can be streamed once the prompt is ingested), but long prompts feel extremely sluggish.

Another is that CPUs with more than 128-bit DDR5 memory busses are very expensive, and CPU token generation is basically RAM bandwidth bound.

It's an ML library written by Georgi Gerganov. It prioritizes inference on Apple hardware and low resource machines. https://github.com/ggerganov/ggml

llama.cpp and alpaca.cpp (and other derivatives) all require model weights to be converted to the ggml format to run.

ggml (https://github.com/ggerganov/ggml) has a GPT-J example, the 6B parameter model runs happily on the CPU 16gb of ram and 8 cores at a couple of words per second, no GPUs necessary.

    gptj_model_load: ggml ctx size = 13334.86 MB
    gptj_model_load: memory_size =  1792.00 MB, n_mem = 57344
    gptj_model_load: model size = 11542.79 MB / num tensors = 285
    main: number of tokens in prompt = 12

    An example of GPT-J running on the CPU is shown in Fig. [4](#Fig4

    main: mem per token = 16179460 bytes
    main:     load time =  7463.20 ms
    main:   sample time =     3.24 ms
    main:  predict time =  4887.26 ms / 232.73 ms per token
    main:    total time = 13203.91 ms
I don't know about these large models but I saw on a random HN comment earlier in a different topic where someone showed a GPT-J model on CPU only: https://github.com/ggerganov/ggml

I tested it on my Linux and Macbook M1 Air and it generates tokens at a reasonable speed using CPU only. I noticed it doesn't quite use all my available CPU cores so it may be leaving some performance on the table, not sure though.

The GPT-J 6B is nowhere near as large as the OPT-175B in the post. But I got the sense that CPU-only inference may not be totally hopeless even for large models if only we got some high quality software to do it.