What does HackerNews think of llama.cpp?

Port of Facebook's LLaMA model in C/C++

Language: C

Decomposing language models into understandable components | Oct 2023

Try to get something like tinygrad[1] running locally, that way you can tweak things a bit, run it again and see how it performs. While doing this you'll pick up most of the concepts and get a feeling of how things work. Also, take a look at projects like llama.cpp[2], you don't have to fully understand what's going on here, though.

You may need some intermediate knowledge of linear algebra and this thing called "data science" nowadays, which is pretty much knowing how to mangle data and visualize it.

Try creating a small model on your own, it doesn't have to be super fancy just make sure it does something you want it to do. And then ... you'll probably could go on your own then.

1: https://github.com/tinygrad/tinygrad

2: https://github.com/ggerganov/llama.cpp

Ollama for Linux – Run LLMs on Linux with GPU Acceleration | Sep 2023

Expand Context ↕

> run their own LLMs on Linux and the unfortunate answer was always that the existing options were slightly complicate

What about https://github.com/ggerganov/llama.cpp ?

It compiles and run easily on Linux.

Llama 2 on togetherAI is as bad of a privacy nightmare as OpenAI | Sep 2023

Better privacy might be running Llama2 locally, offline.

If somebody hasn't tried running LLMs yet, here are some lines that do the job in Google Colab or locally. The !s are for Colab, remove them for local terminal. The script downloads the ca. 8GB model, but Llama.cpp can run offline afterwards.

  ! git clone https://github.com/ggerganov/llama.cpp.git

  ! wget "https://huggingface.co/TheBloke/CodeLlama-7B-GGUF/resolve/main/codellama-7b.Q8_0.gguf" -P llama.cpp/models

  ! cd llama.cpp && make

  ! ./llama.cpp/main -m ./llama.cpp/models/codellama-7b.Q8_0.gguf --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8

Run ChatGPT-like LLMs on your laptop in 3 lines of code | Sep 2023

Saving you some time, if you have a Macbook pro M1/M2 with 32GB of RAM (I presume a lot of HN folks would), you can comfortably run the `34B` models on CPU or GPU.

And... If you'd like a more hands on approach, here is a manual approach to get llama running locally

    - https://github.com/ggerganov/llama.cpp 
    - follow instructions to build it (note the `METAL` flag)
    - https://huggingface.co/models?sort=trending&search=gguf
    - pick any `gguf` model that tickles your fancy, download instructions will be there

and a little script like this will get it running swimmingly

   ./main -m ./models/.gguf --color --keep -1 -n -1 -ngl 32 --repeat_penalty 1.1 -i -ins

Enjoy the next hours of digging through flags and the wonderful pit of time ahead of you.

NOTE: I'm new at this stuff, feedback welcome.

ChatGPT web and mobile UIs unavailable | Aug 2023

Expand Context ↕

If you have an Apple Silicon machine, combine [0] with [1] for state of the art local code completion and general Q/A.

[0] https://huggingface.co/TheBloke/WizardCoder-Python-34B-V1.0-...

[1] https://github.com/ggerganov/llama.cpp

Llama2.c: inference llama 2 in one file of pure C | Jul 2023

Expand Context ↕

That project already exists https://github.com/ggerganov/llama.cpp

Llama2.c: inference llama 2 in one file of pure C | Jul 2023

Expand Context ↕

I would use textsynth (https://bellard.org/ts_server/) or llama.cpp (https://github.com/ggerganov/llama.cpp) if you're running on CPU.

  - I wouldn't use anything higher than a 7B model if you want decent speed.
  - Quantize to 4-bit to save RAM and run inference faster.

Speed will be around 15 tokens per second on CPU (tolerable), and 5-10x faster with a GPU.

Llama 2 | Jul 2023

Expand Context ↕

Probably quantizing or using base weights and this project https://github.com/ggerganov/llama.cpp on a CPU machine with AVX512 instructions.

How is ChatGPT's behavior changing over time? | Jul 2023

Expand Context ↕

>What hardware would you need to run it at home?

Step 1: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/blob/ma...

Step 2: https://github.com/ggerganov/llama.cpp

Step 3: you're welcome

Accessing Llama 2 from the command-line with the LLM-replicate plugin | Jul 2023

For those getting started, the easiest one click installer I've used is Nomic.ai's gpt4all: https://gpt4all.io/

This runs with a simple GUI on Windows/Mac/Linux, leverages a fork of llama.cpp on the backend and supports GPU acceleration, and LLaMA, Falcon, MPT, and GPT-J models. It also has API/CLI bindings.

I just saw a slick new tool https://ollama.ai/ that will let you install a llama2-7b with a single `ollama run llama2` command that has a very simple 1-click installer for Apple Silicon Mac (but need to build from source for anything else atm). It looks like it only supports llamas OOTB but it also seems to use llama.cpp (via Go adapter) on the backend - it seemed to be CPU-only on my MBA, but I didn't poke too much and it's brand new, so we'll see.

For anyone on HN, they should probably be looking at https://github.com/ggerganov/llama.cpp and https://github.com/ggerganov/ggml directly. If you have a high-end Nvidia consumer card (3090/4090) I'd highly recommend looking into https://github.com/turboderp/exllama

For those generally confused, the r/LocalLLaMA wiki is a good place to start: https://www.reddit.com/r/LocalLLaMA/wiki/guide/

I've also been porting my own notes into a single location that tracks models, evals, and has guides focused on local models: https://llm-tracker.info/

Accessing Llama 2 from the command-line with the LLM-replicate plugin | Jul 2023

The gold standard of local-only model inference for LLaMA, alpaca, and friends is LLaMA-cpp, https://github.com/ggerganov/llama.cpp No dependencies, no GPU needed, just point it to a model snapshot that you download separately on bittorrent. Simple CLI tools that are usable (somewhat) from shell scripts.

Hoping they add support for llama 2 soon!

OpenLLaMA 13B Released | Jun 2023

It's not a community per se but there's a lot of research and discussion going on directly in the llama.cpp repo (https://github.com/ggerganov/llama.cpp) if you're interested in the more technical side of things.

Llama.cpp: Full CUDA GPU Acceleration | Jun 2023

Expand Context ↕

> In comparison I could just type git clone https://github.com/ggerganov/llama.cpp and make . And it worked.

You're comparing a single, well managed project that had put effort into user onboarding against all projects of a different language and proclaiming that an entire language/ecosystem is crap.

The only real take away is that many projects, independent of language, put way too little effort towards onboarding users.

Show HN: Smallville – Create generative agents for simulations and games | May 2023

Expand Context ↕

https://github.com/ggerganov/llama.cpp

https://huggingface.co/mosaicml/mpt-7b

A guidance language for controlling LLMs | May 2023

Expand Context ↕

Taks a look at: https://github.com/ggerganov/llama.cpp

How to run Llama 13B with a 6GB graphics card | May 2023

Expand Context ↕

First link: https://github.com/ggerganov/llama.cpp

Which in turn has the following as the first link: https://arxiv.org/abs/2302.13971

Is it really quicker to ask here than just browse content for a bit, skimming some text or even using Google for one minute?

OpenLLaMA: An Open Reproduction of LLaMA | May 2023

To use with llama.cpp on CPU and 8GB RAM

  git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build
  python3 -m pip install -r requirements.txt

  cd models && git clone https://huggingface.co/openlm-research/open_llama_7b_preview_200bt/ && cd -
  python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
  ./build/bin/quantize models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/ggml-model-f16.bin models/open_llama_7b_preview_200bt_q5_0.ggml q5_0
  ./build/bin/main -m models/open_llama_7b_preview_200bt_q5_0.ggml --ignore-eos -n 1280 -p "Building a website can be done in 10 simple steps:" --mlock

MiniGPT-4 | Apr 2023

Expand Context ↕

For a general guide, I recommend: https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

There's a subreddit r/LocalLLaMA that seems like the most active community focused on self-hosting LLMs. Here's a recent discussion on hardware: https://www.reddit.com/r/LocalLLaMA/comments/12lynw8/is_anyo...

If you're looking just for local inference, you're best bet is probably to buy a consumer GPU w/ 24GB of RAM (3090 is fine, 4090 more performance potential), which can fit a 30B parameter 4-bit quantized model that can probably be fine-tuned to ChatGPT (3.5) level quality. If not, then you can probably add a second card later on.

Alternatively, if you have an Apple Silicon Mac, llama.cpp performs surprisingly well, it's easy to try for free: https://github.com/ggerganov/llama.cpp

Current AMD consumer cards have terrible software support and IMO isn't really an option. On Windows you might be able to use SHARK or DirectML ports, but nothing will run out of the box. ROCm still has no RDNA3 support (supposedly coming w/ 5.5 but no release date announced) and it's unclear how well it'll work - basically, unless you would rather be fighting w/ hardware than playing around w/ ML, it's probably best to avoid (the older RDNA cards also don't have tensor cores, so perf would be hobbled even if you could get things running. Lots of software has been written w/ CUDA-only in mind).

ChatGPT's API Is So Good and Cheap, It Makes Most Text Generating AI Obsolete | Mar 2023

Expand Context ↕

I can certainly imagine it after seeing https://github.com/ggerganov/llama.cpp

Still a couple years out but moving way faster than I would have expected.