What does HackerNews think of llama.cpp?

Port of Facebook's LLaMA model in C/C++

Language: C

Try to get something like tinygrad[1] running locally, that way you can tweak things a bit, run it again and see how it performs. While doing this you'll pick up most of the concepts and get a feeling of how things work. Also, take a look at projects like llama.cpp[2], you don't have to fully understand what's going on here, though.

You may need some intermediate knowledge of linear algebra and this thing called "data science" nowadays, which is pretty much knowing how to mangle data and visualize it.

Try creating a small model on your own, it doesn't have to be super fancy just make sure it does something you want it to do. And then ... you'll probably could go on your own then.

1: https://github.com/tinygrad/tinygrad

2: https://github.com/ggerganov/llama.cpp

> run their own LLMs on Linux and the unfortunate answer was always that the existing options were slightly complicate

What about https://github.com/ggerganov/llama.cpp ?

It compiles and run easily on Linux.

Better privacy might be running Llama2 locally, offline.

If somebody hasn't tried running LLMs yet, here are some lines that do the job in Google Colab or locally. The !s are for Colab, remove them for local terminal. The script downloads the ca. 8GB model, but Llama.cpp can run offline afterwards.

  ! git clone https://github.com/ggerganov/llama.cpp.git

  ! wget "https://huggingface.co/TheBloke/CodeLlama-7B-GGUF/resolve/main/codellama-7b.Q8_0.gguf" -P llama.cpp/models

  ! cd llama.cpp && make

  ! ./llama.cpp/main -m ./llama.cpp/models/codellama-7b.Q8_0.gguf --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8
Saving you some time, if you have a Macbook pro M1/M2 with 32GB of RAM (I presume a lot of HN folks would), you can comfortably run the `34B` models on CPU or GPU.

And... If you'd like a more hands on approach, here is a manual approach to get llama running locally

    - https://github.com/ggerganov/llama.cpp 
    - follow instructions to build it (note the `METAL` flag)
    - https://huggingface.co/models?sort=trending&search=gguf
    - pick any `gguf` model that tickles your fancy, download instructions will be there
and a little script like this will get it running swimmingly

   ./main -m ./models/.gguf --color --keep -1 -n -1 -ngl 32 --repeat_penalty 1.1 -i -ins
Enjoy the next hours of digging through flags and the wonderful pit of time ahead of you.

NOTE: I'm new at this stuff, feedback welcome.

If you have an Apple Silicon machine, combine [0] with [1] for state of the art local code completion and general Q/A.

[0] https://huggingface.co/TheBloke/WizardCoder-Python-34B-V1.0-...

[1] https://github.com/ggerganov/llama.cpp

I would use textsynth (https://bellard.org/ts_server/) or llama.cpp (https://github.com/ggerganov/llama.cpp) if you're running on CPU.

  - I wouldn't use anything higher than a 7B model if you want decent speed.
  - Quantize to 4-bit to save RAM and run inference faster.
Speed will be around 15 tokens per second on CPU (tolerable), and 5-10x faster with a GPU.
Probably quantizing or using base weights and this project https://github.com/ggerganov/llama.cpp on a CPU machine with AVX512 instructions.
For those getting started, the easiest one click installer I've used is Nomic.ai's gpt4all: https://gpt4all.io/

This runs with a simple GUI on Windows/Mac/Linux, leverages a fork of llama.cpp on the backend and supports GPU acceleration, and LLaMA, Falcon, MPT, and GPT-J models. It also has API/CLI bindings.

I just saw a slick new tool https://ollama.ai/ that will let you install a llama2-7b with a single `ollama run llama2` command that has a very simple 1-click installer for Apple Silicon Mac (but need to build from source for anything else atm). It looks like it only supports llamas OOTB but it also seems to use llama.cpp (via Go adapter) on the backend - it seemed to be CPU-only on my MBA, but I didn't poke too much and it's brand new, so we'll see.

For anyone on HN, they should probably be looking at https://github.com/ggerganov/llama.cpp and https://github.com/ggerganov/ggml directly. If you have a high-end Nvidia consumer card (3090/4090) I'd highly recommend looking into https://github.com/turboderp/exllama

For those generally confused, the r/LocalLLaMA wiki is a good place to start: https://www.reddit.com/r/LocalLLaMA/wiki/guide/

I've also been porting my own notes into a single location that tracks models, evals, and has guides focused on local models: https://llm-tracker.info/

The gold standard of local-only model inference for LLaMA, alpaca, and friends is LLaMA-cpp, https://github.com/ggerganov/llama.cpp No dependencies, no GPU needed, just point it to a model snapshot that you download separately on bittorrent. Simple CLI tools that are usable (somewhat) from shell scripts.

Hoping they add support for llama 2 soon!

It's not a community per se but there's a lot of research and discussion going on directly in the llama.cpp repo (https://github.com/ggerganov/llama.cpp) if you're interested in the more technical side of things.
> In comparison I could just type git clone https://github.com/ggerganov/llama.cpp and make . And it worked.

You're comparing a single, well managed project that had put effort into user onboarding against all projects of a different language and proclaiming that an entire language/ecosystem is crap.

The only real take away is that many projects, independent of language, put way too little effort towards onboarding users.

First link: https://github.com/ggerganov/llama.cpp

Which in turn has the following as the first link: https://arxiv.org/abs/2302.13971

Is it really quicker to ask here than just browse content for a bit, skimming some text or even using Google for one minute?

To use with llama.cpp on CPU and 8GB RAM

  git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build
  python3 -m pip install -r requirements.txt

  cd models && git clone https://huggingface.co/openlm-research/open_llama_7b_preview_200bt/ && cd -
  python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
  ./build/bin/quantize models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/ggml-model-f16.bin models/open_llama_7b_preview_200bt_q5_0.ggml q5_0
  ./build/bin/main -m models/open_llama_7b_preview_200bt_q5_0.ggml --ignore-eos -n 1280 -p "Building a website can be done in 10 simple steps:" --mlock
For a general guide, I recommend: https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

There's a subreddit r/LocalLLaMA that seems like the most active community focused on self-hosting LLMs. Here's a recent discussion on hardware: https://www.reddit.com/r/LocalLLaMA/comments/12lynw8/is_anyo...

If you're looking just for local inference, you're best bet is probably to buy a consumer GPU w/ 24GB of RAM (3090 is fine, 4090 more performance potential), which can fit a 30B parameter 4-bit quantized model that can probably be fine-tuned to ChatGPT (3.5) level quality. If not, then you can probably add a second card later on.

Alternatively, if you have an Apple Silicon Mac, llama.cpp performs surprisingly well, it's easy to try for free: https://github.com/ggerganov/llama.cpp

Current AMD consumer cards have terrible software support and IMO isn't really an option. On Windows you might be able to use SHARK or DirectML ports, but nothing will run out of the box. ROCm still has no RDNA3 support (supposedly coming w/ 5.5 but no release date announced) and it's unclear how well it'll work - basically, unless you would rather be fighting w/ hardware than playing around w/ ML, it's probably best to avoid (the older RDNA cards also don't have tensor cores, so perf would be hobbled even if you could get things running. Lots of software has been written w/ CUDA-only in mind).

I can certainly imagine it after seeing https://github.com/ggerganov/llama.cpp

Still a couple years out but moving way faster than I would have expected.