This covers three things: Llama.cpp (Mac/Windows/Linux), Ollama (Mac), MLC LLM (iOS/Android)

Which is not really comprehensive... If you have a linux machine with GPUs, i'd just use hugging face inference (https://github.com/huggingface/text-generation-inference). And I am sure there are other things that could be covered.

Just a note that you have to have at least 12GB VRAM for it to be worth even trying to use your GPU for LLaMA 2.

The 7B model quantized to 4 bits can fit in 8GB VRAM with room for the context, but is pretty useless for getting good results in my experience. 13B is better but still not anything near as good as the 70B, which would require >35GB VRAM to use at 4 bit quantization.

My solution for playing with this was just to upgrade my PC's RAM to 64GB. It's slower than the GPU, but it was way cheaper and I can run the 70B model easily.

dc443

I have 2x 3090 do you know if it's feasible to use that 48GB total for running this?

eurekin

Yes, it runs totally fine. I ran it in Oobabooga/text generation web ui. Nice thing about it is that it autodownloads all necessary gpu binaries on it's own and creates a isolated conda env. I asked same questions on the official 70b demo and got same answers. I even got better answers with ooba, since the demo cuts text early

Ooobabooga: https://github.com/oobabooga/text-generation-webui

Model: TheBloke_Llama-2-70B-chat-GPTQ from https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ

ExLlama_HF loader gpu split 20,22, context size 2048

on the Chat Settings tab, choose Instruction template tab and pick Llama-v2 from the instruction template dropdown

Demo: https://huggingface.co/blog/llama2#demo