You may need some intermediate knowledge of linear algebra and this thing called "data science" nowadays, which is pretty much knowing how to mangle data and visualize it.
Try creating a small model on your own, it doesn't have to be super fancy just make sure it does something you want it to do. And then ... you'll probably could go on your own then.
What about https://github.com/ggerganov/llama.cpp ?
It compiles and run easily on Linux.
If somebody hasn't tried running LLMs yet, here are some lines that do the job in Google Colab or locally. The !s are for Colab, remove them for local terminal. The script downloads the ca. 8GB model, but Llama.cpp can run offline afterwards.
! git clone https://github.com/ggerganov/llama.cpp.git
! wget "https://huggingface.co/TheBloke/CodeLlama-7B-GGUF/resolve/main/codellama-7b.Q8_0.gguf" -P llama.cpp/models
! cd llama.cpp && make
! ./llama.cpp/main -m ./llama.cpp/models/codellama-7b.Q8_0.gguf --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8
And... If you'd like a more hands on approach, here is a manual approach to get llama running locally
- https://github.com/ggerganov/llama.cpp
- follow instructions to build it (note the `METAL` flag)
- https://huggingface.co/models?sort=trending&search=gguf
- pick any `gguf` model that tickles your fancy, download instructions will be there
and a little script like this will get it running swimmingly ./main -m ./models/.gguf --color --keep -1 -n -1 -ngl 32 --repeat_penalty 1.1 -i -ins
Enjoy the next hours of digging through flags and the wonderful pit of time ahead of you.NOTE: I'm new at this stuff, feedback welcome.
[0] https://huggingface.co/TheBloke/WizardCoder-Python-34B-V1.0-...
- I wouldn't use anything higher than a 7B model if you want decent speed.
- Quantize to 4-bit to save RAM and run inference faster.
Speed will be around 15 tokens per second on CPU (tolerable), and 5-10x faster with a GPU.Step 1: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/blob/ma...
Step 2: https://github.com/ggerganov/llama.cpp
Step 3: you're welcome
This runs with a simple GUI on Windows/Mac/Linux, leverages a fork of llama.cpp on the backend and supports GPU acceleration, and LLaMA, Falcon, MPT, and GPT-J models. It also has API/CLI bindings.
I just saw a slick new tool https://ollama.ai/ that will let you install a llama2-7b with a single `ollama run llama2` command that has a very simple 1-click installer for Apple Silicon Mac (but need to build from source for anything else atm). It looks like it only supports llamas OOTB but it also seems to use llama.cpp (via Go adapter) on the backend - it seemed to be CPU-only on my MBA, but I didn't poke too much and it's brand new, so we'll see.
For anyone on HN, they should probably be looking at https://github.com/ggerganov/llama.cpp and https://github.com/ggerganov/ggml directly. If you have a high-end Nvidia consumer card (3090/4090) I'd highly recommend looking into https://github.com/turboderp/exllama
For those generally confused, the r/LocalLLaMA wiki is a good place to start: https://www.reddit.com/r/LocalLLaMA/wiki/guide/
I've also been porting my own notes into a single location that tracks models, evals, and has guides focused on local models: https://llm-tracker.info/
Hoping they add support for llama 2 soon!
You're comparing a single, well managed project that had put effort into user onboarding against all projects of a different language and proclaiming that an entire language/ecosystem is crap.
The only real take away is that many projects, independent of language, put way too little effort towards onboarding users.
Which in turn has the following as the first link: https://arxiv.org/abs/2302.13971
Is it really quicker to ask here than just browse content for a bit, skimming some text or even using Google for one minute?
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build
python3 -m pip install -r requirements.txt
cd models && git clone https://huggingface.co/openlm-research/open_llama_7b_preview_200bt/ && cd -
python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
./build/bin/quantize models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/ggml-model-f16.bin models/open_llama_7b_preview_200bt_q5_0.ggml q5_0
./build/bin/main -m models/open_llama_7b_preview_200bt_q5_0.ggml --ignore-eos -n 1280 -p "Building a website can be done in 10 simple steps:" --mlock
There's a subreddit r/LocalLLaMA that seems like the most active community focused on self-hosting LLMs. Here's a recent discussion on hardware: https://www.reddit.com/r/LocalLLaMA/comments/12lynw8/is_anyo...
If you're looking just for local inference, you're best bet is probably to buy a consumer GPU w/ 24GB of RAM (3090 is fine, 4090 more performance potential), which can fit a 30B parameter 4-bit quantized model that can probably be fine-tuned to ChatGPT (3.5) level quality. If not, then you can probably add a second card later on.
Alternatively, if you have an Apple Silicon Mac, llama.cpp performs surprisingly well, it's easy to try for free: https://github.com/ggerganov/llama.cpp
Current AMD consumer cards have terrible software support and IMO isn't really an option. On Windows you might be able to use SHARK or DirectML ports, but nothing will run out of the box. ROCm still has no RDNA3 support (supposedly coming w/ 5.5 but no release date announced) and it's unclear how well it'll work - basically, unless you would rather be fighting w/ hardware than playing around w/ ML, it's probably best to avoid (the older RDNA cards also don't have tensor cores, so perf would be hobbled even if you could get things running. Lots of software has been written w/ CUDA-only in mind).
Still a couple years out but moving way faster than I would have expected.