What does HackerNews think of ggml?

TinyGrad is also targeting CPU inference, and IIRC it works ok in Apache TVM.

One note is that prompt ingestion is extremely slow on CPU compared to GPU. So short prompts are fine (and tokens can be streamed once the prompt is ingested), but long prompts feel extremely sluggish.

Another is that CPUs with more than 128-bit DDR5 memory busses are very expensive, and CPU token generation is basically RAM bandwidth bound.

Deep Neural Networks from Scratch in Zig | Apr 2023

It would be cool too see a port of llama.cpp and ggml to Zig

https://github.com/ggerganov/llama.cpp

https://github.com/ggerganov/ggml

https://news.ycombinator.com/item?id=35350975 | Mar 2023

Expand Context ↕

It's an ML library written by Georgi Gerganov. It prioritizes inference on Apple hardware and low resource machines. https://github.com/ggerganov/ggml

llama.cpp and alpaca.cpp (and other derivatives) all require model weights to be converted to the ggml format to run.

Show HN: Alpaca.cpp – Run an Instruction-Tuned Chat-Style LLM on a MacBook | Mar 2023

Expand Context ↕

Georgi rewrote the code on top of his own tensor library (ggml[0]).

[0] https://github.com/ggerganov/ggml

ChatGPT-J: The Privacy-First, Self-Hosted Chatbot Built on GPT-J's Powerful AI | Mar 2023

Expand Context ↕

ggml (https://github.com/ggerganov/ggml) has a GPT-J example, the 6B parameter model runs happily on the CPU 16gb of ram and 8 cores at a couple of words per second, no GPUs necessary.

    gptj_model_load: ggml ctx size = 13334.86 MB
    gptj_model_load: memory_size =  1792.00 MB, n_mem = 57344
    gptj_model_load: model size = 11542.79 MB / num tensors = 285
    main: number of tokens in prompt = 12

    An example of GPT-J running on the CPU is shown in Fig. [4](#Fig4

    main: mem per token = 16179460 bytes
    main:     load time =  7463.20 ms
    main:   sample time =     3.24 ms
    main:  predict time =  4887.26 ms / 232.73 ms per token
    main:    total time = 13203.91 ms

Running large language models like ChatGPT on a single GPU | Feb 2023

Expand Context ↕

I don't know about these large models but I saw on a random HN comment earlier in a different topic where someone showed a GPT-J model on CPU only: https://github.com/ggerganov/ggml

I tested it on my Linux and Macbook M1 Air and it generates tokens at a reasonable speed using CPU only. I noticed it doesn't quite use all my available CPU cores so it may be leaving some performance on the table, not sure though.

The GPT-J 6B is nowhere near as large as the OPT-175B in the post. But I got the sense that CPU-only inference may not be totally hopeless even for large models if only we got some high quality software to do it.