What does HackerNews think of GPTQ-for-LLaMa?

4 bits quantization of LLaMa using GPTQ

Language: Python

I wonder where such difference between llama.cpp and [1] repo comes from. F16 difference in perplexity is .3 on 7B model, which is not insignificant. ggml quirks are definitely need to be fixed.

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa

For this specific implementation here's info from llama.cpp repo:

Perplexity - model options

5.5985 - 13B, q4_0

5.9565 - 7B, f16

6.3001 - 7B, q4_1

6.5949 - 7B, q4_0

6.5995 - 7B, q4_0, --memory_f16

According to this repo[1] difference is about 3% in their implementation with right group size. If you'd like to know more, I think you should read GPTQ paper[2].

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa

[2] https://arxiv.org/abs/2210.17323

The 4-bit GPTQ LLaMA models are the current top-performers. This site has done a lot of the heavy lifting: https://github.com/qwopqwop200/GPTQ-for-LLaMa

With 30b-4bit on a RTX 4090, I'm seeing numbers like:

Output generated in 4.17 seconds (4.03 tokens/s, 21 tokens)

Output generated in 4.38 seconds (4.25 tokens/s, 23 tokens)

Output generated in 4.57 seconds (4.25 tokens/s, 24 tokens)

Output generated in 3.86 seconds (3.40 tokens/s, 17 tokens)

The lower size (7b, 13b) are even faster with lower memory use. A 16GB 3080 should be able to run the 13b at 4-bit just fine with reasonable (>1 token/s) latency.

text-generation-webui supports state of the art 4bit GPTQ quantization for LLaMA[0], reducing VRAM overhead by 75% with no output performance loss compared to baseline fp16.[1]

LLaMA-13B, rivaling GPT-3 175B, requires only 10GB* of VRAM with 4bit GPTQ quantization.

LLaMA-30B fits on a 24GB* consumer video card with no output performance loss, beating GPT-3 175B.

Multi-GPU support[2] means LLaMA-65B, rivaling PaLM-540B, runs on 2x3090.

*Further improvements in active development will reduce VRAM requirements another 30-40% with no performance loss (ex. flash attention).

[0] https://github.com/qwopqwop200/GPTQ-for-LLaMa

[1] https://arxiv.org/abs/2210.17323

[2] https://github.com/oobabooga/text-generation-webui/issues/14...

For LLaMA set-up instructions, including GPTQ 4bit, refer to this wiki article: https://github.com/oobabooga/text-generation-webui/wiki/LLaM...

I'm running 4-bit quantized llamas on torch/cuda with https://github.com/qwopqwop200/GPTQ-for-LLaMa, and I'm seeing significant tokens/second perf degradation compared to 8-bit bitsandbytes mode. I'm very new to this, and understand very little detail, but I thought it would be faster?