This leaves a ton of stuff out.

- Token generation is serial and bandwidth bound, but prompt ingestion is not and runs in batches of 512+. Short tests are fast on pure CPU llama.cpp, but long prompting (such as with ongoing conversation) is extremely slow compared to other backends.

- Llama.cpp now has very good ~4 bit quantization that doesn't affect perplexity much. Q6_K almost has the same perplexity as FP16, but is still massively smaller.

- Batching is a big thing to ignore outside of personal deployments.

- The real magic of llama.cpp is model splitting. A small discrete GPU can completely offload prompt ingestion and part of the model inference. And it doesn't have to be an Nvidia GPU! There is no other backend that will do that so efficiently in the generative AI space.

- Hence the GPU backends (OpenCL, Metal, CUDA, soon ROCm and Vulkan) are the defacto way to run llama.cpp these days. Without them, I couldn't even run 70B on my desktop, or 33B on my (16GB RAM) laptop.

ROCm works now! I just set it up tonight on a 6900xt with 16gb vram running wayland at the same time. The trick was using the opencl-amd package (somehow rocm packages don't depend on opencl, but llama does, idk).

I'm astonished at the results I can get from the q6_K models.

Can you please share more info on this? I have a 6900xt "gathering dust" in a proxmox server - would like to try to do a passthrough to a vm and use it. Thank you in advance!

Sure thing. There's a bunch of ways to do it, but here's some quick notes on what I did.

* arch linux has tons of `rocm` packages. I installed pretty much all of them: https://archlinux.org/packages/?sort=&q=rocm&maintainer=&fla...

* you also need this one package from AUR: https://aur.archlinux.org/packages/opencl-amd

* llama.cpp now has GPU support including "CLBlast", which is what we need for this, so compile with LLAMA_CLBLAST=ON

* now you can run any model llama.cpp supports, so grab some ggml models that fit on the card from https://huggingface.co/TheBloke.

* Test it out with: ./main -t 30 -ngl 128 -m huginnv1.2.ggmlv3.q6_K.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"

* You should see `BLAS = 1` in the llama.cpp output and you should get maybe 5 tokens per second on a 13b 6bit quantized ggml model.

* You can compile llama-cpp-python with the same arguments and get text-generation-ui to work also, but there's a bit of dependency fighting to do it.

* koboldcpp might be better, I just haven't tried it yet

Hope that helps!

Edit: just tried https://github.com/LostRuins/koboldcpp and it also works great. I should have started here probably.

Compile with `make LLAMA_CLBLAST=1` run with ` python koboldcpp.py --useclblast 0 0 --gpulayers 128 huginnv1.2.ggmlv3.q6_K.bin`