Summary of part one
In part one of this article, we tried to answer what accuracy loss from quantization really means, using a quantitative testing approach based on the MMLU benchmark test. We found that as precision is reduced, an LLM model starts to answer questions incorrectly, even if it was able to answer those questions correctly at higher precision. The model can also start to have issues to answer at all.
Most importantly though, there seems to be a tipping point when going below Q3 precision, where accuracy drops sharply, making lower precision less beneficial. If you have not read part 1, I highly recommend that you start reading from there.
Based on the learnings from the quantitative testing, we now focus the test effort to only the 7B model and three quantization levels, Q6_K which is the highest available precision with weighted/iMatrix quantization (equivalent to Q8_0 static), IQ3_XXS which is close to the tipping point and finally IQ2_XXS which is passed the tipping point.
Test strategy for qualitative testing
While quantitative metrics, like MMLU (with 14.000+ questions), show accuracy loss, qualitative testing reveals how quantization affects real-world usability, like empathy in dialogs or nuance in translations.
While quantitative testing focused on measurable accuracy loss over a large number of test questions (14.000+), with the qualitative testing we will instead focus on fewer questions and look more closely at the models responses at different levels of quantization.
To do this, I've defined a set of tasks, designed to test different aspects of an LLMs capabilities, which will give us a deep understanding of what accuracy loss looks like.
Summarization
For the summarization task, we will make use of the English abstract of a research paper, consisting of 410 words, and ask the LLM to provide a three-sentence summarization, highlighting key facts. Each quantization level will be asked to generate the summarization three times. The answers are then evaluated based on number of key facts included, with one point awarded per key fact. A total of 11 key facts were manually identified from the abstract.
(Full text with abstract and identified key facts will be made available on GitHub.)
Summarization Results
All three quantization levels reduce the abstract down to 20% of the original text, and on average 60-70% of key facts are identified and included in the answer.
To show how the summaries look and compares, one of the answers from each of the quantization levels are presented with key facts highlighted in the text.
Q6_K summarization:
IQ3_XXS summarization:
IQ2_XXS summarization:
The three different quantization levels perform quite similar on this task, if anything IQ2_XXS performs slightly better, which turns out to be something of an outlier once we get to the other tasks. Relevance (identified key facts) is even and all answers represent coherent summaries that are easy to read.
The verdict for the summarization task is therefore that all tested quantization levels performs equally well.
Dialog Generation
For this task we will prompt the LLM "I'm feeling really stressed about my upcoming exam.", three times for each of the quantization levels. We then compare the answers and try to determine if there is differences in naturalness, contextual consistency, empathy and if the advice returned is more or less helpful.
Dialog Results
All three quantization levels respond with an acknowledgement of the feeling of stress and that it's normal, followed by a list of advice or tips that might help to manage the stress and then concludes with an acknowledgement again. The basic format of the answers are the same, but number of advice given and the articulation varies. Based on the three answers per quantization level, IQ2_XXS seems to provide shorter answers with fewer words, compared to Q6_K and IQ3_XXS, which could indicate less helpful answers.
To get an overview and compare the helpfulness of the quantization levels, we map all of the advice in the answers, to specific categories of advice, and then for each category of advice we rate which quantization provides the most to least help, taking unique pieces of information into account and overall impression. The most helpful quantization level gets 3 points, second gets 2 points and last gets 1 point.
Overall Q6_K and IQ3_XXS provided more advice than IQ2_XXS and more unique pieces of information per advice. The total score of Q6_K and IQ3_XXS are almost the same at 19 and 18, while IQ2_XXS only scored 11 points (40% less). While there is a subjective element to the scoring, it's still clear that IQ2_XXS provides shorter answers with less information.
Dialog Conclusion
The conclusion for this task is therefore that Q6_K and IQ3_XXS performs equally well, while IQ2_XXS still performs but at a reduced level of helpfulness.
Long-Context Coherence, Translation and conclusion follows in part three.