Has anyone done any comprehensive analysis on exactly how much quantization affects the quality of model output? I haven't seen any more than people running it and being impressed (or not) by a few sample outputs.

I would be very curious about some contrastive benchmarks between a quantized and non-quantized version of the same model.

Some results here: https://github.com/ggerganov/llama.cpp/discussions/406

tl;dr quantizing the 13B model gives up about 30% of the improvement you get from moving from 7B to 13B - so quantized 13B is still much better than unquantized 7B. Similar results for the larger models.

terafo

I wonder where such difference between llama.cpp and [1] repo comes from. F16 difference in perplexity is .3 on 7B model, which is not insignificant. ggml quirks are definitely need to be fixed.

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa