What does HackerNews think of GPTQ-for-LLaMa?
4 bits quantization of LLaMa using GPTQ
Perplexity - model options
5.5985 - 13B, q4_0
5.9565 - 7B, f16
6.3001 - 7B, q4_1
6.5949 - 7B, q4_0
6.5995 - 7B, q4_0, --memory_f16
According to this repo[1] difference is about 3% in their implementation with right group size. If you'd like to know more, I think you should read GPTQ paper[2].
With 30b-4bit on a RTX 4090, I'm seeing numbers like:
Output generated in 4.17 seconds (4.03 tokens/s, 21 tokens)
Output generated in 4.38 seconds (4.25 tokens/s, 23 tokens)
Output generated in 4.57 seconds (4.25 tokens/s, 24 tokens)
Output generated in 3.86 seconds (3.40 tokens/s, 17 tokens)
The lower size (7b, 13b) are even faster with lower memory use. A 16GB 3080 should be able to run the 13b at 4-bit just fine with reasonable (>1 token/s) latency.
LLaMA-13B, rivaling GPT-3 175B, requires only 10GB* of VRAM with 4bit GPTQ quantization.
LLaMA-30B fits on a 24GB* consumer video card with no output performance loss, beating GPT-3 175B.
Multi-GPU support[2] means LLaMA-65B, rivaling PaLM-540B, runs on 2x3090.
*Further improvements in active development will reduce VRAM requirements another 30-40% with no performance loss (ex. flash attention).
[0] https://github.com/qwopqwop200/GPTQ-for-LLaMa
[1] https://arxiv.org/abs/2210.17323
[2] https://github.com/oobabooga/text-generation-webui/issues/14...
For LLaMA set-up instructions, including GPTQ 4bit, refer to this wiki article: https://github.com/oobabooga/text-generation-webui/wiki/LLaM...