What does HackerNews think of koboldcpp?
A simple one-file way to run various GGML models with KoboldAI's UI
https://github.com/LostRuins/koboldcpp
The biggest catch is it doesn't support llama.cpp's continuous batching yet. Maybe soon?
- The AI Horde hosts a web app (Kobold Lite) geared towards LLM chat/instruct. Its mature, predating LLAMA and GPT 3.5 and largely developed when the RP community was running GPT-J finetunes. There are desktop apps that can access this API as well.
- The user sets the chat syntax/format and picks a LLM host (or multiple hosts).
- These hosts run API endpoints on their own PCs/servers for Horde users to access. The backends du jour are koboldcpp, a frontend for llama.cpp which is excellent, portable and literally one click, and KoboldAI, with the very fast and vram-efficient exllamav2 backend:
https://github.com/LostRuins/koboldcpp
https://github.com/henk717/KoboldAI
- Hosts pick a quantized community LLM to run, which is (IMO) the real magic of this system. Cloud services tend to run generic Llama chat/instruct models, OpenAI API models, or maybe a single proprietary finetune, but the Llama/Mistral finetuning community is red hot. New finetines and crazy merges/hybrids that outperform llama-chat in specific tasks (mostly Chat/Storytelling/RP) come out every day, and each one has a different "flavor" and format:
https://huggingface.co/models?sort=modified&search=mistral+g...
https://huggingface.co/models?sort=modified&search=13b+exl2
https://huggingface.co/models?sort=modified&search=20b
- The horde workers then earn "kudos" for serving requests to clients. Anyone can use the kobold horde with no login, but when requests are queued, kudos earn hosts priority access to LLMs other hosts are running.
I really like this scheme. Perverse incentives for spamming and such are minimal. Its free and easy for users on old hardware, but excessive leechers are deprioritized through the kudos system. It encourages experimentation with hosting (and trying) new finetunes, and gives hosts access to models they normally wouldn't know about or can't run. The horde API seems relatively simple and cheap to host, since its just text prompts/responses bouncing around. Its easy for me to host on a spare laptop, or in the background when I am not stressing my desktop GPU.
There seems to be tons of interest in making LLMs "easy" to play with and giving access to GPU-poor users, but Kobold Horde has somehow flown under the radar.
The UI is relatively mature, as it predates llama. It includes upstream llama.cpp PRs, integrated AI horde support, lots of sampling tuning knobs, easy gpu/cpu offloading, and its basically dependency free.
* llama.cpp - https://github.com/ggerganov/llama.cpp
* KoboldCpp - https://github.com/LostRuins/koboldcpp
* GPT4All - https://gpt4all.io/index.html
llama.ccp will run LLMs that have been ported to the gguf format. If you have enough RAM, you can even run the big 70 billion parameter models. If you have a CUDA GPU, you can even offload part of the model onto the GPU and have the CPU do the rest, so you can get some partial performance benefit.The issue is that the big models run too slowly on a CPU to feel interactive. Without a GPU, you'll get much more reasonable performance running a smaller 7 billion parameter model instead. The responses won't be as good as the larger models, but they may still be good enough to be worthwhile.
Also, development in this space is still coming extremely rapidly, especially for specialized models like ones tuned for coding.
Grab KoboldCPP and a GGML model from TheBloke that would fit your RAM/VRAM and try it.
Make sure you follow the prompt structure for the model that you will see on TheBloke's download page for the model (very important).
KoboldCPP: https://github.com/LostRuins/koboldcpp
TheBloke: https://huggingface.co/TheBloke
I would start with a 13b or 7b model quantized to 4-bits just to get the hang of it. Some generic or story telling model.
Just make sure you follow the prompt structure that the model card lists.
KoboldCPP is very easy to use. You just drag the model file onto the executable, wait till it loads and go to the web interface.
* arch linux has tons of `rocm` packages. I installed pretty much all of them: https://archlinux.org/packages/?sort=&q=rocm&maintainer=&fla...
* you also need this one package from AUR: https://aur.archlinux.org/packages/opencl-amd
* llama.cpp now has GPU support including "CLBlast", which is what we need for this, so compile with LLAMA_CLBLAST=ON
* now you can run any model llama.cpp supports, so grab some ggml models that fit on the card from https://huggingface.co/TheBloke.
* Test it out with: ./main -t 30 -ngl 128 -m huginnv1.2.ggmlv3.q6_K.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
* You should see `BLAS = 1` in the llama.cpp output and you should get maybe 5 tokens per second on a 13b 6bit quantized ggml model.
* You can compile llama-cpp-python with the same arguments and get text-generation-ui to work also, but there's a bit of dependency fighting to do it.
* koboldcpp might be better, I just haven't tried it yet
Hope that helps!
Edit: just tried https://github.com/LostRuins/koboldcpp and it also works great. I should have started here probably.
Compile with `make LLAMA_CLBLAST=1` run with ` python koboldcpp.py --useclblast 0 0 --gpulayers 128 huginnv1.2.ggmlv3.q6_K.bin`
- Download koboldcpp: https://github.com/LostRuins/koboldcpp
- Download your 70B ggml model of choice, for instance airoboros 70B Q3_K_L: https://huggingface.co/models?sort=modified&search=70b+ggml
- Run Koboldcpp with opencl (or rocm) with as many layers as you can manage on the GPU. If you use rocm, you need to install the rocm package from your linux distro (or direct from AMD on Windows).
- Access the UI over http. Switch to instruct mode and copy in the correct prompt formatting from the model download page.
- If you are feeling extra nice, get an AI Horde API key and contribute your idle time to the network, and try out other models on from other hosts: https://lite.koboldai.net/#
It should be much faster with llama.cpp. My old-ish laptop CPU (AMD 4900HS) can ingest a big prompt reasonably quickly and then stream text fast enough to (slowly) read.
If you have any kind of dGPU, even a small laptop one, prompt ingestion is dramatically faster.
Try the latest Kobold release: https://github.com/LostRuins/koboldcpp
But to answer your question, the GGML CPU implementation is very good, and actually generating the response is somewhat serial, and more RAM speed bound than compute bound.
There may be even larger llama tunes for Japanese... But huggingface is tricky to search.
I would suggest baking the LORA into a ggml file and then running it with koboldcpp (with opencl offloading) for maximum ease of use.
https://github.com/LostRuins/koboldcpp
You could get even better results using Japanese llama.cpp grammar, but I have not been down that rabbit hole:
GGML 13b models at 4bit (Q4_0) take somewhere around 9gb of ram and q5_K_M take about 11gb. Gpu offloading support has also been added, I've been using 22 layers on my laptop rtx 2070 max q 8gb vram and CLBlast. I get around ~2-3 tokens per second with 13b models. In my experience, running 13b models is worth the extra time it takes to generate a response compared to 7b models. GPTQ is faster, I think, but I can't fit a quantized 13b model so I don't use it.
TheBloke [2] has been quantizing models and uploading them to HF and will probably upload a quantized version of this online soon. His discord server also has good guides to help you get going, linked in the model card of most of his models.
https://github.com/LostRuins/koboldcpp
https://huggingface.co/TheBloke
Edit: There's a bug with the newest Nvidia drivers that causes speed slowdown with large context size. I downgraded and stayed on 531.61 . The theory is that newer drivers change how out of cuda memory management works when trying to avoid OOM errors.
https://www.reddit.com/r/LocalLLaMA/comments/1461d1c/major_p...
Native fine tuning is still out of consumer reach for the forseeable future, but there's people experimenting with QLORAs. The pipeline is still relatively new though and is a bit involved.
https://github.com/LostRuins/koboldcpp
Its a llama.cpp wrapper descended from the roleplaying community, but works fine (and performantly) for questioning and such.
You will need to download the model from HF quantize it yourself: https://github.com/ggerganov/llama.cpp#prepare-data--run