Impact of quantization on LLMs
Quantization reduces model accuracy, but what does that actully mean?
When hosting an LLM, either locally or in the cloud, you can either use the full sized version of the model or an, often times, more manageable smaller quantized version. There are often several different quantized versions to choose between, so how do you know which one is the best one for you? Quantization "may reduce accuracy" in the model, but what does that really mean?
At the end of this article, you will know the answers to these questions and have the knowledge to pick the right LLM version for your hardware or hosting environment.
What is quantization and why is it used?
Quantization is a technique that strategically reduces the precision of numerical data to achieve a more compact representation. This involves converting numbers with higher precision (represented by many decimals) to numbers with lower precision (represented by fewer decimals). In the context of Large Language Models (LLMs), quantization is used to significantly reduce memory usage and accelerate computation, making it possible to deploy and run these large models more efficiently in different environments and on consumer hardware.
What we want to answer, is what the impact of quantization is, for example, an image that has been compressed using too much quantization, becomes blocky and color transitions, such as in a clear blue sky, becomes less smooth. It's very easy to see when an image have been compressed too much, but how do you "see" if a large language model have been compressed too much? And how do we balance memory usage and accelerated computation against loss of precision or accuracy?
Method of investigation
To answer these questions we will first construct a quantitative test, which provides insights into what happens to a model as it is subjected to increasing levels of quantization. Based on the insights gained, we will then run a number of qualitative tests (in part two of this article) to get deeper insights and a tangible understanding of what the impact of quantization really is, beyond "reduced accuracy".
Constructing the quantitative test
To provide any reliable insights, we need to construct a test with more than a handful questions, we need to scale up. To do this efficiently, we will repurpose the well-known MMLU benchmark test.
This benchmark test consists of more than 14.000 questions, covering four different areas or categories, with subjects ranging from high school education to machine learning. The principle behind this benchmark is to present a question and four possible answers to the LLM and have it decide what the correct answer is.
This makes it rather straight forward to implement a very basic test framework around this benchmark while hosting the LLM models locally using Ollama.
While there are various open source test frameworks available, such as LM-Eval, I chose to implement the test framework by my self, due to convenience and simplicity for evaluating quantization. I used the default (zero-shot) prompt template for MMLU in LM-Eval, but implemented the evaluation of the LLMs response differently.
The response from the LLM is categorized as either Correct, Incorrect or Invalid. The answer needs to be in the format of a single letter (corresponding to the chosen option) or a single letter directly followed by a ".", anything else is considered invalid. This is important, as it helps us measure how well the LLM complies with the provided instructions. A valid answer is categorized as correct, if it matches the right answer according to the benchmark dataset.
Why MMLU as test?
While metrics such as perplexity is sometimes used for measuring the impact of changes to a model, such as through quantization, the actual perplexity values depend on several different parameters making it difficult to compare across models and test setups. Perplexity primarily focuses on how well the model predicts the next word, which is still very abstract in terms of understanding how quantization changes the models accuracy. MMLU on the other hand has been a dominant benchmark test for LLMs, with scores that can be compared across models. The test is also more relatable, as it focuses on testing a models breadth and depth of knowledge, reasoning and problem solving abilities across a wide range of subjects.
Models and quantization’s
With the MMLU benchmark ready, I picked two LLMs, the Qwen 2.5 7B and 14B instruct model (1M context length versions), quantized by mradermacher. He provides a wide range of both static and weighted/iMatrix quantization’s, available on HuggingFace.
This gives a total of 16 test runs, where each model and selected quantization runs through the MMLU benchmark of 14.000+ questions.
The tests will be executed on a machine with the above specifications.
Models and memory size
The models listed above, range from 13 GB down to 3.2 GB, as the models are more heavily quantized, the size of the models goes down.
Results from quantitative testing
Once the test runs were completed, all the data were analysed using QlikView.
Accelerating inference
One aspect of quantization, is to accelerate the inference, making calculations go faster. To what extent is this true, and is always more quantization better for performance?
The chart above shows how long time it took to run through all the questions in the MMLU test, with quantization along the X axis and time spent, in minutes, along the Y axis. The quantization are sorted on size, with IQ1_S being the lowest precision and Q8_0 the highest precision.
Let's start with the red and blue lines, these are the 7B version of the model with static and weighted quantization.
At a quantization of Q8_0 the time it takes to run through all the questions is 41 minutes, while Q6_K takes 25 minutes. Quantization made inference around 40% faster. The reason for this, is that the test is run on an Nvidia RTX 4060 GPU with 8GB VRAM and about 88% of the model fits within this VRAM at Q8_0. At Q6_K and lower, the entire model fits inside the available VRAM, this is what makes the performance difference. Once the entire model fits within the GPU memory, no further performance is gained from additional quantization.
The same thing is true for the 14B model (green line), at Q6_K 59% of the model fits within available GPU memory, while smaller sizes fits 100%, bringing inference time down from 172 minutes to around 45 minutes.
While there is some variations in performance below 8-bit, with a model size that fits within available VRAM, there is no significant performance gain with even lower precision. This is due to the GPU not natively supporting 4-bit operations, anything smaller than 8-bits, like Q4_K_M, will still be treated as if it was an 8 bit number and take the same computational resources as a real 8 bit number.
Nvidias new generation of GPUs (Blackwell), have native support for 4 and 6 bit numbers in addition to 8, 16 and 32, once these become mainstream there will be further performance gains to be had with lower precision quantization. Until then, the learning is:
For best performance, pick a quantization that reduces the size of the model to what fits within your GPUs VRAM, but don't pick anything smaller, unless you need the VRAM for something else.
Alternatively, if performance really is an issue, go for a model with less parameters.
Ability to provide correct answers
Next we look at the models ability to recall it's training data and answer the MMLU questions correctly.
One of the first things we see here, is that the static quantization (red) performs noticeably worse than weighted/iMatrix quantization (blue) at the same level of quantization.
This first learning is easy,
Weighted/iMatrix quantization is preferred over static quantization, as more accuracy is preserved at the same level of quantization.
What is also clear is that quantization has a noticeable impact on the models ability to recall it's training data and provide correct answers. The 7B model at Q8_0 (static) and Q6_K (weighted) achieves around 70% correct answers, while the very same model, at IQ2_XXS achieves only 51%, that's a 26% decline in accuracy, due to quantization. (The IQ1_S performed so badly, that the test kept failing and had to be restarted several times, and even then, it was only possible to go through 30% of the questions.)
The table above lists the percentage of correct answers from the 7B model with the given quantization, along with the percentage of accuracy loss relative to Q8_0 quantization. The table makes it clear, that going with any lower precision than IQ3_XXS or Q3_K_M, will introduce a significant loss of accuracy.
The same pattern is repeated for the 14B model, the table above shows a significant increase in loss of accuracy at quantization’s with lower precision than IQ3_XXS.
The models ability to recall, understand and correctly answer questions decrease significantly as quantization goes below Q3 and will have a measurable impact on the models ability to perform.
Ability to follow instructions and answer
In relation to accuracy, measuring the models ability to provide any answer, be it correct or incorrect, is just as important as measuring the models correct answers.
Measuring the number of invalid answers can be used as an indication of the models ability, or lack there of, to follow instructions and provide answers when prompted to do so.
What can be seen, is that on both 7B and 14B, both models are able to follow instructions and answer questions without issues at Q3 precision and higher, but starts to have problems around Q2.
IQ1_S was kept out of the above chart, as the model was unable to answer 70% of the questions presented to it.
While 1-2% invalid answers might seem like a little in the chart, but taking a closer look at the absolute numbers in the table above, it's still quite a lot of questions that the LLM is unable to answer at Q2 precision.
There is at least one question in the MMLU dataset that is incorrect, where the provided options don't match the question, which is one of the situations where the LLM might object and refuse to provide a valid answer. However in the case of the above failures due to quantization, the answer is more like a malfunction where the LLM is not able to provide any answer at all other than garbage.
LLMs are never 100% predictable, but the issue of invalid answers is that it introduces failure behaviour which makes it harder to reliably integrate the LLM into other functionality.
The learning from this is therefore,
At a precision lower than Q3, the model starts to have issues following instructions and answering questions, making the model more unpredictable and difficult to use.
Summary of quantitative testing
Already with the information gained from the quantitative testing, we can clearly see a measurable impact of quantization on model accuracy and performance.
The impact of quantization on accuracy is two fold, questions that the model was able to answer correctly, the model starts to answer incorrectly, and second the model starts to fail to answer at all.
The transition point for small accuracy loss to big, happens when going below Q3 quantization.
We have also shown that with current Nvidia hardware, not natively supporting 4-bit calculations, there is no performance gain to be had with quantization below 8-bits, other than for the purpose of reducing memory usage.
For the decision of what model and quantization to pick, when choosing a model for self hosting, the following will provide guidance:
For best overall accuracy and performance, pick a quantization level that reduces the LLM model size enough for it to fit within your GPUs VRAM,* but don't pick a smaller size then that**, as there is no performance gains to be had, only additional accuracy loss. Most importantly, don't go below Q3 quantization, if performance is an issue, consider a smaller model with less parameters instead.**
We now know on a statistical or numerical level, what the impact of quantization is on an LLM model, in part two of this article, we will do a number of qualitative tests to make it even more tangible what the impact is on the models behaviour.
(*VRAM is used for several things during LLM inference, such as the data being processed. See Bringing K/V Context Quantisation to Ollama for a neat estimation tool.)
In part two of this article, we define a set of questions specifically designed to test different aspects of an LLM and compare results between different quantization levels, to determine what the practical quality difference is between quantization levels.