What does HackerNews think of llama.cpp?

Port of Facebook's LLaMA model in C/C++

Language: C

Try to get something like tinygrad[1] running locally, that way you can tweak things a bit, run it again and see how it performs. While doing this you'll pick up most of the concepts and get a feeling of how things work. Also, take a look at projects like llama.cpp[2], you don't have to fully understand what's going on here, though.

You may need some intermediate knowledge of linear algebra and this thing called "data science" nowadays, which is pretty much knowing how to mangle data and visualize it.

Try creating a small model on your own, it doesn't have to be super fancy just make sure it does something you want it to do. And then ... you'll probably could go on your own then.

1: https://github.com/tinygrad/tinygrad

2: https://github.com/ggerganov/llama.cpp

> run their own LLMs on Linux and the unfortunate answer was always that the existing options were slightly complicate

What about https://github.com/ggerganov/llama.cpp ?

It compiles and run easily on Linux.

Better privacy might be running Llama2 locally, offline.

If somebody hasn't tried running LLMs yet, here are some lines that do the job in Google Colab or locally. The !s are for Colab, remove them for local terminal. The script downloads the ca. 8GB model, but Llama.cpp can run offline afterwards.

  ! git clone https://github.com/ggerganov/llama.cpp.git

  ! wget "https://huggingface.co/TheBloke/CodeLlama-7B-GGUF/resolve/main/codellama-7b.Q8_0.gguf" -P llama.cpp/models

  ! cd llama.cpp && make

  ! ./llama.cpp/main -m ./llama.cpp/models/codellama-7b.Q8_0.gguf --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8
Saving you some time, if you have a Macbook pro M1/M2 with 32GB of RAM (I presume a lot of HN folks would), you can comfortably run the `34B` models on CPU or GPU.

And... If you'd like a more hands on approach, here is a manual approach to get llama running locally

    - https://github.com/ggerganov/llama.cpp 
    - follow instructions to build it (note the `METAL` flag)
    - https://huggingface.co/models?sort=trending&search=gguf
    - pick any `gguf` model that tickles your fancy, download instructions will be there
and a little script like this will get it running swimmingly

   ./main -m ./models/.gguf --color --keep -1 -n -1 -ngl 32 --repeat_penalty 1.1 -i -ins
Enjoy the next hours of digging through flags and the wonderful pit of time ahead of you.

NOTE: I'm new at this stuff, feedback welcome.

If you have an Apple Silicon machine, combine [0] with [1] for state of the art local code completion and general Q/A.

[0] https://huggingface.co/TheBloke/WizardCoder-Python-34B-V1.0-...

[1] https://github.com/ggerganov/llama.cpp

I would use textsynth (https://bellard.org/ts_server/) or llama.cpp (https://github.com/ggerganov/llama.cpp) if you're running on CPU.

  - I wouldn't use anything higher than a 7B model if you want decent speed.
  - Quantize to 4-bit to save RAM and run inference faster.
Speed will be around 15 tokens per second on CPU (tolerable), and 5-10x faster with a GPU.
Probably quantizing or using base weights and this project https://github.com/ggerganov/llama.cpp on a CPU machine with AVX512 instructions.
The gold standard of local-only model inference for LLaMA, alpaca, and friends is LLaMA-cpp, https://github.com/ggerganov/llama.cpp No dependencies, no GPU needed, just point it to a model snapshot that you download separately on bittorrent. Simple CLI tools that are usable (somewhat) from shell scripts.

Hoping they add support for llama 2 soon!

It's not a community per se but there's a lot of research and discussion going on directly in the llama.cpp repo (https://github.com/ggerganov/llama.cpp) if you're interested in the more technical side of things.
> In comparison I could just type git clone https://github.com/ggerganov/llama.cpp and make . And it worked.

You're comparing a single, well managed project that had put effort into user onboarding against all projects of a different language and proclaiming that an entire language/ecosystem is crap.

The only real take away is that many projects, independent of language, put way too little effort towards onboarding users.

First link: https://github.com/ggerganov/llama.cpp

Which in turn has the following as the first link: https://arxiv.org/abs/2302.13971

Is it really quicker to ask here than just browse content for a bit, skimming some text or even using Google for one minute?

To use with llama.cpp on CPU and 8GB RAM

  git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build
  python3 -m pip install -r requirements.txt

  cd models && git clone https://huggingface.co/openlm-research/open_llama_7b_preview_200bt/ && cd -
  python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
  ./build/bin/quantize models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/ggml-model-f16.bin models/open_llama_7b_preview_200bt_q5_0.ggml q5_0
  ./build/bin/main -m models/open_llama_7b_preview_200bt_q5_0.ggml --ignore-eos -n 1280 -p "Building a website can be done in 10 simple steps:" --mlock
I can certainly imagine it after seeing https://github.com/ggerganov/llama.cpp

Still a couple years out but moving way faster than I would have expected.