Home › Articles › KV Cache Quantization Benchmarks for Long Context
Tests on Qwen 3.6 27B show why TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM.
Disclaimer: this is just results of a few basic KV cache benchmarks, not the whole truth. Longer context should make the differences more pronounced, but that still doesn't guarantee 1:1 with real usage. Maybe TurboQuant magic works better with actually getting tool calls correctly at extra long context in agentic coding, or maybe it's the other way around. I tried some other tests too, but in terms of meaningful results PPL and KLD are all I've got.
This benchmark started with a narrow question: does TurboQuant-style KV cache compression have any defensible niche in BeeLlama's (my llama.cpp fork) recommendation set, or is q4/q5 strictly better where it fits?
Hardware: one RTX 3090 (24 GB VRAM), Ryzen 7 5700X3D, 32 GB RAM. Model: Qwen 3.6 27B. Q5_K_S weights plus a 64k bf16 KV cache fit on the card; 128k does not, the weights alone fill too much VRAM. IQ4_XS weights are smaller, so IQ4_XS plus a 128k bf16 KV cache fits. Each context length is capped at the maximum where bf16 KV still runs, because bf16 is the reference every quantized mode is measured against. IQ4_XS at 64k is also included to compare directly against Q5_K_S at the same 64k context, isolating the effect of weight quant on cache quality (§9).
The tok/s column in the benchmark data tables is prefill throughput (batch size 2048, ubatch 256), not generation speed. GPU power was lowered during the runs. These numbers are useful only for relative comparison between cache modes on the same hardware, the absolute values do not represent real-world generation performance.
The tests ran on BeeLlama.cpp v0.1.2, which keeps the normal llama.cpp flow but adds DFlash plus TurboQuant/TCQ KV-cache compression. The TurboQuant implementation being evaluated comes from TheTom's llama-cpp-turboquant, while the TCQ variants come from buun's llama.cpp fork and the accompanying TCQ paper and codebooks.
The PPL numbers look flat across the board. Even turbo modes barely move the average:
| Cache | Size vs bf16 | Q5_K_S 64k | IQ4_XS 64k | IQ4_XS 128k |
|---|---|---|---|---|
| bf16 | 100% | 5.4800 | 5.5169 | 5.2724 |
| q8_0 | 53.1% | 5.4774 | 5.5157 | 5.2716 |
| q5_1 | 37.5% | 5.4777 | 5.5181 | 5.2723 |
| q5_0 | 34.4% | 5.4802 | 5.5175 | 5.2738 |
| q4_1 | 31.3% | 5.4808 | 5.5237 | 5.2772 |
| q4_0 | 28.1% | 5.4877 | 5.5251 | 5.2803 |
| turbo4 | 25.8% | 5.4841 | 5.5277 | 5.2822 |
| turbo3_tcq | 20.3% | 5.5054 | 5.5426 | 5.2985 |
| turbo3 | 19.5% | 5.5149 | 5.5561 | 5.3084 |
| turbo2_tcq | 14.1% | 5.5705 | 5.6085 | 5.3513 |
| turbo2 | 13.3% | 5.6403 | 5.6823 | 5.4287 |
Full PPL results for all cache modes and configurations are in §11.1.
q5_0 uses 34.4% of bf16 KV and saves 65% memory; q4_0 uses 28.1%. Through q4_0, the entire PPL range is under 0.01. Even turbo3_tcq at 20.3% only adds about 0.02. turbo2 at 13.3% shows a visible hit, but even that is under 0.17 PPL absolute. If PPL were the whole story, the recommendation would be simple: compress aggressively and move on.
PPL averages over every token equally, so one position that destroys a JSON key or hallucinates a closing brace gets diluted by thousands of unremarkable tokens. The metric that picks up that tail damage is KL divergence against the bf16 baseline — specifically the 99.9% KLD, which measures the worst 0.1% of positions. The 99.9% precision column converts it to a percentage: 100 * exp(-(quantKLD - bf16KLD)).
The Q5_K_S 64k KL table shows what PPL hides. Mean precision stays above 99% for almost every mode, but the 99.9% column tells a different story:
| Cache | Size vs bf16 | Mean KLD | 99% KLD | 99.9% KLD | 99.9% prec. | Tok/s |
|---|---|---|---|---|---|---|
| bf16 | 100.0% | 0.000375 | 0.005234 | 0.023258 | 100.0% | 850.81 |
| q8_0 | 53.1% | 0.002328 | 0.019669 | 0.078709 | 94.6% | 851.11 |
| q8_0-q5_1 | 45.3% | 0.002529 | 0.020346 | 0.082880 | 94.2% | 828.63 |
| q8_0-q5_0 | 43.8% | 0.002656 | 0.021039 | 0.088486 | 93.7% | 847.33 |
| q8_0-q4_1 | 42.2% | 0.003080 | 0.023390 | 0.099080 | 92.7% | 786.54 |
| q8_0-q4_0 | 40.6% | 0.003316 | 0.024892 | 0.104680 | 92.2% | 849.37 |
| q5_1 | 37.5% | 0.002911 | 0.022604 | 0.098354 | 92.8% | 841.65 |
| q8_0-turbo3_tcq | 36.7% | 0.005090 | 0.037056 | 0.149387 | 88.2% | 817.57 |
| q5_0 | 34.4% | 0.003206 | 0.022759 | 0.099073 | 92.7% | 849.79 |
| q5_1-q4_1 | 34.4% | 0.003380 | 0.025886 | 0.095011 | 93.1% | 846.27 |
| q5_0-q4_1 | 32.8% | 0.003471 | 0.025829 | 0.099618 | 92.6% | 847.59 |
| q5_1-q4_0 | 32.8% | 0.003626 | 0.025668 | 0.108649 | 91.8% | 846.91 |
| q4_1 | 31.3% | 0.004476 | 0.031166 | 0.141813 | 88.8% | 854.33 |
| q5_0-q4_0 | 31.3% | 0.003581 | 0.027423 | 0.113332 | 91.4% | 847.64 |
| q5_1-turbo3_tcq | 28.9% | 0.005594 | 0.038175 | 0.144591 | 88.6% | 816.05 |
| q4_0 | 28.1% | 0.004711 | 0.033663 | 0.130419 | 89.8% | 855.08 |
| q5_0-turbo3_tcq | 27.3% | 0.005471 | 0.038214 | 0.158514 | 87.3% | 815.80 |
| q5_0-turbo3 | 27.0% | 0.007097 | 0.048761 | 0.192428 | 84.4% | 837.90 |
| q4_1-turbo3_tcq | 25.8% | 0.006184 | 0.042997 | 0.174831 | 85.9% | 816.95 |
| turbo4 | 25.8% | 0.004760 | 0.035205 | 0.138370 | 89.1% | 705.32 |
| q4_0-turbo3_tcq | 24.2% | 0.006269 | 0.045421 | 0.186572 | 84.9% | 821.89 |
| q4_0-turbo3 | 23.8% | 0.008235 | 0.056527 | 0.222154 | 82.0% | 839.29 |
| q4_0-turbo2_tcq | 21.1% | 0.015168 | 0.105461 | 0.395244 | 68.9% | 826.07 |
| turbo3_tcq | 20.3% | 0.007978 | 0.058286 | 0.227104 | 81.6% | 795.20 |
| turbo3 | 19.5% | 0.011181 | 0.082015 | 0.296060 | 76.1% | 836.75 |
| turbo3_tcq-turbo2_tcq | 17.2% | 0.016386 | 0.115072 | 0.437043 | 66.1% | 796.16 |
| turbo3-turbo2 | 16.4% | 0.023985 | 0.168258 | 0.605087 | 55.9% | 831.88 |
| turbo2_tcq | 14.1% | 0.023073 | 0.170350 | 0.632401 | 54.4% | 807.25 |
| turbo2 | 13.3% | 0.036230 | 0.276438 | 0.903576 | 41.5% | 842.29 |
Full KL tables for all three configurations are in §11.2.
The mean KLD column still looks reasonable: most modes stay below 0.01. But the 99.9% column diverges sharply. q5_0 at 34.4% of bf16 KV has a 99.9% KLD of 0.099, which is 42× its mean. q4_0 at 28.1% jumps to 0.130, which looks small as a number but is a 32% increase over q5_0 in the tail, and this will break your tool calls. Below q4_0, turbo modes fall off a cliff: turbo3_tcq at 20.3% of bf16 KV reaches 0.227 in the tail, and turbo2 at 13.3% hits 0.904, so roughly one full nat of divergence at the worst one-in-a-thousand positions.
The size column makes the comparison direct. q4_0 saves 71.9% of bf16 KV and keeps 89.8% of 99.9% precision. turbo3_tcq saves 79.7% and keeps 81.6%. turbo2 saves 86.7% and collapses to 41.5%. The PPL table suggested turbo3_tcq was fine; the tail says it is not, unless you genuinely need the memory and can accept the risk. And then turbo2 is only viable for tasks where exact structure does not matter.
The q4/q5/q8 modes in these results are not naive scalar quantization. llama.cpp applies a random rotation to KV cache vectors before quantizing them — the same basic trick TurboQuant uses. The rotation spreads outlier energy across coordinates, so the scalar quantizer faces a more uniform distribution and makes fewer catastrophic rounding decisions on extreme values. The difference is what happens after rotation: q4/q5 quantize each value independently to 4–5 bits with a scalar codebook, while TurboQuant quantizes to 2–3 bits with a scalar codebook and optionally adds a QJL residual for inner-product estimation. TCQ (§4) constrains index sequences through a trellis instead.
TurboQuant is also slower. turbo4 runs at ~705 tok/s (prefill) versus ~850 tok/s for q4_0 on the Q5_K_S config, a 17% throughput penalty. turbo3_tcq runs at ~794–844 tok/s across the three configs, slower than q4_0 in every case. The rotation and QJL stages add compute that scalar quantization avoids.
The rotation overlap means TurboQuant is not competing against a helpless baseline. It is competing against q4/q5 modes that already benefit from the same outlier smoothing. At 2–3 bits, the scalar codebook is starved enough that rotation alone cannot save you, and TurboQuant's extra structure matters. At 4 bits, turbo4 has no quality advantage over q4_0, saves almost no memory (25.8% vs 28.1% of bf16 KV), and runs slower. The value in TurboQuant is at the low bit rates where q4/q5 cannot reach at all.
TCQ changes the low-bit side of the table. buun's paper ("Closing the Gap: Trellis-Coded Quantization for KV Cache at 2–3 Bits") describes TCQ as the first application of trellis-coded quantization to LLM KV cache compression. Instead of choosing each scalar index independently, TCQ constrains index sequences through a finite-state trellis, which gives a much larger effective codebook at the same bit rate. The encoder uses Viterbi to find a globally optimal assignment, while the decode path stays parallel: each value can be decoded in O(1), which is why the method can still fit GPU flash attention kernels.
The reported TCQ results claim 10–44% KL-divergence reduction over scalar quantization at 2–3 bits per value, with context-adaptive norm scaling and FWHT rotation plus random sign flips. That matches what this BeeLlama bench sees directionally: turbo3_tcq is consistently much better than plain turbo3, and turbo2_tcq is much better than plain turbo2, especially in the 99.9% tail.
There is no turbo4_tcq in this table, and that makes sense. TCQ is most valuable where independent scalar codes are starved for bits. At 2–3 bits, the larger effective codebook can close a visible gap. At 4 bits, the ordinary scalar codebook already has enough resolution, and the extra trellis decode overhead is not worth paying for marginal quality gains on top of a mode that is already uncompetitive with q4_0.
The asymmetric rows are the most useful part of the final report. They confirm the K-first intuition, but with a limit.
At 31.3% of bf16 KV, q5_0-q4_0 (K at q5_0, V at q4_0) has the same size as symmetric q4_1, yet it beats q4_1 across all three 99.9% precision tables: 91.4% vs 88.8% in Q5_K_S 64k, 96.1% vs 93.7% in IQ4_XS 64k, and 96.2% vs 95.0% in IQ4_XS 128k. Spending the same budget of bits on a stronger K and weaker V outperforms splitting them evenly.
q5_0-q4_0 is the clean trade at that size: same footprint as q4_1, but better tail behavior from spending bits asymmetrically toward K. One step up, q5_0-q4_1 costs 32.8% of bf16 KV, only 1.5 points more than q5_0-q4_0, and improves the 99.9% tail to 92.6%, 96.2%, and 96.5% across the three KL tables. That is not a huge jump, but it is cheap.
The direct comparison between q5_0-q4_1 and q5_1-q4_0 answers the next-bit question. Both are the same size. q5_0-q4_1 is better in Q5_K_S and IQ4_XS 128k, while IQ4_XS 64k is effectively tied in 99.9% precision. After K reaches q5_0, the next useful bit appears to go to V, not to q5_1 K.
The K-first ladder that falls out of these comparisons, for VRAM-constrained setups in particular:
| Role | K / V |
|---|---|
| Reference | bf16 / bf16 |
| Highest fidelity | q8_0 / q5_0 |
| High quality | q5_0 / q5_0 |
| Balanced | q5_0 / q4_1 |
| Good default | q5_0 / q4_0 |
| Memory saver | q4_0 / q4_0 |
q5_1 remains useful as an extra-conservative option, but it is not where the marginal value is.
The bench is harsh on turbo modes, but it does not invalidate them. I'd say it just clearly defines their niche, albeit one that is much more narrow than I expected.
turbo4 is the easiest mode to drop from recommendations. At IQ4_XS 128k, it saves only 192 MiB over q4_0, 2112 MiB vs 2304 MiB, while throughput drops from 710.51 tok/s to 519.76 tok/s and 99.9% precision falls from 94.3% to 93.1%. That is not a good exchange.
Plain turbo3 and turbo2 are also hard to recommend when TCQ variants are available. turbo3_tcq beats turbo3 clearly in the tail, and turbo2_tcq is much better than turbo2. Hardware support matters here: if a backend does not support the TCQ variants, users have to choose between paying more VRAM for q4/q5 or accepting the quality loss from the plain turbo modes. But that should be a documented fallback, not the main path.
The valuable turbo result is turbo3_tcq. It lands at 20.3% of bf16 KV, with 99.9% precision of 81.6%, 84.5%, and 86.1% across the three KL tables. That is not q4 quality, and it should not be sold as such. But it is a real precision/VRAM/speed compromise for users who need the context to fit.
turbo2_tcq is the last resort. It keeps broad PPL better than the tail suggests, but the 99.9% precision values are 54.4%, 58.6%, and 63.5%. That is a mode for rough summarization, long-context reading where exactness is not the job, or setups that otherwise cannot run. It is not a tool-call, JSON, code-edit, or math mode.
There is also a small middle rung: q4_0-turbo3_tcq. It uses 24.2% of bf16 KV and sits between q4_0 and symmetric turbo3_tcq in the 99.9% tail. It too is not a precision preset, but it's useful for users who need something smaller than q4 and cleaner than full turbo3_tcq.
The added q8 asymmetric rows do not bring q8 back as the normal recommendation. They create a proper fidelity tier.
q8_0-q5_0 at 43.8% of bf16 KV keeps 99.9% precision at 93.7%, 98.2%, and 98.0% across the three KL tables. q8_0-q5_1 is a little better at 45.3%, but the gain is small: 94.2%, 98.3%, and 98.1%. Full q8_0/q8_0 sits at 53.1% and is mostly useful for validation and blame isolation.
q8_0-q4_1 is more interesting technically than practically. Its 99.9% precision is close to q5_1 in the table, which reinforces that K precision matters more than V. But it is bigger than q5_1, and its Q5_K_S speed is anomalous at 786.54 tok/s — likely an external issue rather than a real difference, since other asymmetric pairs at similar sizes and the IQ4_XS re-run both show normal throughput.
The old recommendation set treated TurboQuant too broadly. The new one should separate ordinary precision modes, fidelity modes, and long-context survival modes.
| Preset | K / V | What it is for |
|---|---|---|
| reference | bf16 / bf16 | Baseline comparison |
| fidelity-max | q8_0 / q8_0 | Maximum compressed fidelity |
| fidelity-safe | q8_0 / q5_1 | Slightly more conservative fidelity |
| fidelity | q8_0 / q5_0 | Very low KV impact without full q8 size |
| quality | q5_0 / q5_0 | Great precision/size preset |
| balanced-safe | q5_0 / q4_1 | Best near-quality size trade |
| balanced | q5_0 / q4_0 | Serious default |
| memory | q4_0 / q4_0 | Normal memory saver |
| long-safe | q4_0 / turbo3_tcq | Smaller than q4, cleaner than turbo3_tcq |
| long | turbo3_tcq / turbo3_tcq | Compact long-context mode |
| emergency | turbo2_tcq / turbo2_tcq | Last resort |
The modes I would not present as first-class recommendations are q4_1/q4_1, q5_1/q4_1, q5_1/q4_0, q8_0/q4_0, q8_0/turbo3_tcq, q5_0/turbo3_tcq, q5_1/turbo3_tcq, q4_1/turbo3_tcq, q4_0/turbo2_tcq, plain turbo3, plain turbo2, and turbo4 unless its speed/implementation profile changes.
That is not a rejection of TurboQuant, but a better boundary. If q5 or q4 fits, use q5 or q4. If the job is strict and tool-heavy, prefer q5_0. If the job is memory-bound but still ordinary, use q5_0/q4_0 or q5_0/q4_1. If the job cannot fit without deeper KV compression, that is where turbo3_tcq earns its place. If even that does not fit, turbo2_tcq exists, but the user should know what they are giving up.
The benchmark does not show "Q beats TQ" or "TQ beats Q." It shows that the line between them has moved. Rotation made ordinary q4/q5 quite decent, TCQ makes very low-bit turbo modes much more viable than plain turbo2/turbo3, and the user wins beacuse there's a somewhat clear preset ladder with obvious trade-offs.
The bench ran two weight quantizations: Q5_K_S and IQ4_XS. Same model, same context lengths where they overlap, same cache modes. The comparison is a controlled variable that most KV cache benchmarks do not isolate. It should be: weight precision changes how much cache quantization costs in the tail.
At 64k context, every symmetric cache mode lands 3–5% higher in 99.9% precision on IQ4_XS than on Q5_K_S:
| Cache | Q5_K_S 99.9% prec. | IQ4_XS 99.9% prec. | Gap |
|---|---|---|---|
| q8_0 | 94.6% | 98.7% | +4.1 |
| q5_0 | 92.7% | 97.3% | +4.6 |
| q4_0 | 89.8% | 93.0% | +3.2 |
| turbo4 | 89.1% | 93.0% | +3.9 |
| turbo3_tcq | 81.6% | 84.5% | +2.9 |
| turbo3 | 76.1% | 79.3% | +3.2 |
| turbo2_tcq | 54.4% | 58.6% | +4.2 |
| turbo2 | 41.5% | 41.0% | -0.5 |
IQ4_XS is consistently less damaged by the same cache quant. The gap exists across almost every mode and only collapses at turbo2, where both weight quants are already in the floor.
The raw KLD numbers show the same thing more sharply. Q5_K_S q8_0 has 99.9% KLD of 0.078709; IQ4_XS q8_0 has 0.017372. That is a 4.5× difference. At q4_0 the ratio drops to 1.7× (0.130419 vs 0.076663). The gap compresses as cache quant gets more aggressive, because the cache quantization noise starts to dominate over the weight quantization difference.
This looks backwards at first glance. Q5_K_S preserves more weight precision. The attention computation is more accurate. Shouldn't stricter weights make the model more robust to cache quantization, not less?
The answer is in what KL divergence measures here: the shift in output distribution when you change only the KV cache, holding everything else constant. Q5_K_S produces richer KV distributions. The attention scores carry more fine-grained detail because the weight matrices they multiply through are more precise. Those richer KV vectors have more tail-relevant information: structurally important tokens, sharp probability differences, outlier activations that carry real signal. When you quantize those KV values, you lose more because there is more to lose. The perturbation is larger relative to the information content of the uncompressed cache.
IQ4_XS already injects weight quantization noise into the attention path. The KV distributions it produces are smoother, carry less fine detail, and have shallower tails because the weight quantization already smoothed out some of the sharper features. Cache quantization on top of that does less incremental damage because there is less fragile detail left to damage.
This is not "loose weights + loose cache is better than strict weights + loose cache." The absolute output quality of Q5_K_S with bf16 KV is better than IQ4_XS with bf16 KV. But KL divergence is measured against each model's own bf16 baseline, not against an external ground truth. The metric captures incremental cost, not absolute quality. A Q5_K_S model at q4_0 cache has moved further from its own potential than an IQ4_XS model at q4_0 cache has moved from its (lower) potential.
The practical resolution: the same cache preset is tail-safer on a lower-weight-precision model than on a higher-weight-precision model. If you are running Q5_K_S, you should lean harder toward q5_0 over q4_0 than you would on IQ4_XS. The preset ladder in §8 does not account for this interaction, and it should: Q5_K_S users should treat balanced (q5_0/q4_0) as their floor, while IQ4_XS users can safely step down to memory (q4_0/q4_0) without incurring the same tail cost.
Wikitext PPL and KL divergence are the only benchmarks that made it into this article, but several other tests were tried and abandoned because they could not distinguish between cache quantization modes.
Multiple-choice benchmarks (ARC-Challenge, ARC-Easy, HellaSwag, MMLU) all run at short context, well under 10K tokens. The KV cache is tiny at those lengths, and every cache mode scores within noise of every other mode. They measure whether the model knows things, not whether quantized cache degrades retrieval or reasoning. The spread across cache types was indistinguishable from run-to-run variance.
Perplexity-based HellaSwag and Winogrande have the same problem: they do not exercise the KV cache enough to show a difference. They confirm that the model still speaks English after cache quantization, which was never in doubt.
Synthetic passkey and needle-in-a-haystack retrieval were tried at 32K context. Every cache type, including turbo2, scored 100%. The model can regurgitate a hidden string just fine even with aggressively quantized cache, because retrieval of a single attended token is a different failure mode than slowly diverging output distributions over thousands of tokens. A 30-question passkey test does not have the statistical power to catch what KL divergence catches.
JSON schema generation (structured output constrained by a schema) also hit 100% for all cache types. Cache quantization does not break template compliance at the scale tested.
Native passkey via llama-passkey was excluded because it requires the model's native context length of 262144 tokens, which OOMs on 24 GB VRAM and has no --ctx-size override.
AIME (30 mathematics competition problems) is the only discarded benchmark that could show a difference. The problem is cost: on an RTX 3090, one pass through the 30 problems for a single cache configuration would take roughly 3–4 days of non-stop inference. I committed to it anyway and ran bf16, q8_0, and q4_0 before realizing that a single pass is nearly useless: 30 questions is wildly noisy for a binary measure of correctness, with q8_0 and q4_0 getting the same result. You need dozens of runs to average out the variance, and by the time that finishes, AGI will have already been invented and the whole topic of local inference optimization will be moot.
| Cache | KV cache (MiB) | Size vs bf16 | Median PPL | Precision vs bf16 | PPL +/- | Tok/s | Elapsed (s) |
|---|---|---|---|---|---|---|---|
| bf16 | 4096.00 | 100.0% | 5.4800 | 100.0% | 0.03465 | 851.75 | 326.30 |
| q8_0 | 2176.00 | 53.1% | 5.4774 | 100.0% | 0.03465 | 851.57 | 331.27 |
| q5_1 | 1536.00 | 37.5% | 5.4777 | 100.0% | 0.03464 | 848.27 | 332.64 |
| q5_0 | 1408.00 | 34.4% | 5.4802 | 100.0% | 0.03466 | 848.36 | 332.45 |
| q4_1 | 1280.00 | 31.3% | 5.4808 | 100.0% | 0.03467 | 853.49 | 330.43 |
| q4_0 | 1152.00 | 28.1% | 5.4877 | 99.9% | 0.03473 | 853.50 | 330.76 |
| turbo4 | 1056.00 | 25.8% | 5.4841 | 99.9% | 0.03468 | 705.06 | 395.16 |
| turbo3_tcq | 832.00 | 20.3% | 5.5054 | 99.5% | 0.03480 | 794.21 | 353.31 |
| turbo3 | 800.00 | 19.5% | 5.5149 | 99.4% | 0.03493 | 802.71 | 344.83 |
| turbo2_tcq | 576.00 | 14.1% | 5.5705 | 98.4% | 0.03566 | 805.25 | 348.78 |
| turbo2 | 544.00 | 13.3% | 5.6403 | 97.2% | 0.03581 | 840.34 | 335.07 |
| Cache | KV cache (MiB) | Size vs bf16 | Median PPL | Precision vs bf16 | PPL +/- | Tok/s | Elapsed (s) |
|---|---|---|---|---|---|---|---|
| bf16 | 4096.00 | 100.0% | 5.5169 | 100.0% | 0.03497 | 909.83 | 336.43 |
| q8_0 | 2176.00 | 53.1% | 5.5157 | 100.0% | 0.03499 | 910.43 | 309.49 |
| q5_1 | 1536.00 | 37.5% | 5.5181 | 100.0% | 0.03501 | 906.38 | 310.47 |
| q5_0 | 1408.00 | 34.4% | 5.5175 | 100.0% | 0.03500 | 906.88 | 310.27 |
| q4_1 | 1280.00 | 31.3% | 5.5237 | 99.9% | 0.03505 | 911.35 | 308.85 |
| q4_0 | 1152.00 | 28.1% | 5.5251 | 99.9% | 0.03505 | 912.54 | 308.61 |
| turbo4 | 1056.00 | 25.8% | 5.5277 | 99.8% | 0.03508 | 746.00 | 372.98 |
| turbo3_tcq | 832.00 | 20.3% | 5.5426 | 99.5% | 0.03513 | 844.33 | 331.64 |
| turbo3 | 800.00 | 19.5% | 5.5561 | 99.3% | 0.03533 | 882.76 | 318.29 |
| turbo2_tcq | 576.00 | 14.1% | 5.6085 | 98.4% | 0.03599 | 858.66 | 326.24 |
| turbo2 | 544.00 | 13.3% | 5.6823 | 97.1% | 0.03621 | 898.66 | 312.75 |
| Cache | KV cache (MiB) | Size vs bf16 | Median PPL | Precision vs bf16 | PPL +/- | Tok/s | Elapsed (s) |
|---|---|---|---|---|---|---|---|
| bf16 | 8192.00 | 100.0% | 5.2724 | 100.0% | 0.03269 | 703.61 | 389.94 |
| q8_0 | 4352.00 | 53.1% | 5.2716 | 100.0% | 0.03271 | 707.02 | 387.84 |
| q5_1 | 3072.00 | 37.5% | 5.2723 | 100.0% | 0.03271 | 702.53 | 390.33 |
| q5_0 | 2816.00 | 34.4% | 5.2738 | 100.0% | 0.03272 | 703.18 | 390.05 |
| q4_1 | 2560.00 | 31.3% | 5.2772 | 99.9% | 0.03276 | 708.53 | 387.23 |
| q4_0 | 2304.00 | 28.1% | 5.2803 | 99.9% | 0.03276 | 709.40 | 386.58 |
| turbo4 | 2112.00 | 25.8% | 5.2822 | 99.8% | 0.03281 | 520.46 | 520.82 |
| turbo3_tcq | 1664.00 | 20.3% | 5.2985 | 99.5% | 0.03281 | 647.40 | 421.95 |
| turbo3 | 1600.00 | 19.5% | 5.3084 | 99.3% | 0.03301 | 689.73 | 396.86 |
| turbo2_tcq | 1152.00 | 14.1% | 5.3513 | 98.5% | 0.03363 | 655.44 | 416.90 |
| turbo2 | 1088.00 | 13.3% | 5.4287 | 97.1% | 0.03386 | 696.53 | 393.24 |
KL precision uses 100 * exp(-(quantKLD - bf16KLD)). The 99.9% precision column applies the same formula to 99.9% KLD. Symmetric and asymmetric cache configurations are combined in each table; asymmetric cache names are K-V.
| Cache | KV cache (MiB) | Size vs bf16 | Mean KLD | Precision vs bf16 | KLD +/- | 90% KLD | 95% KLD | 99% KLD | 99.9% KLD | 99.9% precision vs bf16 | Maximum KLD | Tok/s | Elapsed (s) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| bf16 | 4096.00 | 100.0% | 0.000375 | 100.0% | 0.000058 | 0.000568 | 0.001693 | 0.005234 | 0.023258 | 100.0% | 7.374046 | 850.81 | 337.59 |
| q8_0 | 2176.00 | 53.1% | 0.002328 | 99.8% | 0.000125 | 0.004233 | 0.006656 | 0.019669 | 0.078709 | 94.6% | 14.355996 | 851.11 | 337.66 |
| q8_0-q5_1 | 1856.00 | 45.3% | 0.002529 | 99.8% | 0.000143 | 0.004557 | 0.007108 | 0.020346 | 0.082880 | 94.2% | 15.367683 | 828.63 | 346.71 |
| q8_0-q5_0 | 1792.00 | 43.8% | 0.002656 | 99.8% | 0.000168 | 0.004673 | 0.007348 | 0.021039 | 0.088486 | 93.7% | 17.987650 | 847.33 | 338.90 |
| q8_0-q4_1 | 1728.00 | 42.2% | 0.003080 | 99.7% | 0.000115 | 0.005645 | 0.008587 | 0.023390 | 0.099080 | 92.7% | 8.073231 | 786.54 | 364.58 |
| q8_0-q4_0 | 1664.00 | 40.6% | 0.003316 | 99.7% | 0.000165 | 0.005976 | 0.009075 | 0.024892 | 0.104680 | 92.2% | 13.481506 | 849.37 | 338.13 |
| q5_1 | 1536.00 | 37.5% | 0.002911 | 99.7% | 0.000167 | 0.005045 | 0.007916 | 0.022604 | 0.098354 | 92.8% | 13.397068 | 841.65 | 341.63 |
| q8_0-turbo3_tcq | 1504.00 | 36.7% | 0.005090 | 99.5% | 0.000188 | 0.009736 | 0.014401 | 0.037056 | 0.149387 | 88.2% | 20.128752 | 817.57 | 350.23 |
| q5_0 | 1408.00 | 34.4% | 0.003206 | 99.7% | 0.000286 | 0.005232 | 0.008194 | 0.022759 | 0.099073 | 92.7% | 22.619892 | 849.79 | 338.00 |
| q5_1-q4_1 | 1408.00 | 34.4% | 0.003380 | 99.7% | 0.000195 | 0.006140 | 0.009479 | 0.025886 | 0.095011 | 93.1% | 21.394011 | 846.27 | 339.25 |
| q5_0-q4_1 | 1344.00 | 32.8% | 0.003471 | 99.7% | 0.000206 | 0.006310 | 0.009582 | 0.025829 | 0.099618 | 92.6% | 21.863117 | 847.59 | 339.65 |
| q5_1-q4_0 | 1344.00 | 32.8% | 0.003626 | 99.7% | 0.000212 | 0.006441 | 0.009773 | 0.025668 | 0.108649 | 91.8% | 15.809726 | 846.91 | 339.23 |
| q4_1 | 1280.00 | 31.3% | 0.004476 | 99.6% | 0.000267 | 0.007716 | 0.011901 | 0.031166 | 0.141813 | 88.8% | 18.150869 | 854.33 | 336.49 |
| q5_0-q4_0 | 1280.00 | 31.3% | 0.003581 | 99.7% | 0.000174 | 0.006600 | 0.010058 | 0.027423 | 0.113332 | 91.4% | 14.938599 | 847.64 | 338.79 |
| q5_1-turbo3_tcq | 1184.00 | 28.9% | 0.005594 | 99.5% | 0.000291 | 0.010264 | 0.015324 | 0.038175 | 0.144591 | 88.6% | 24.684429 | 816.05 | 350.73 |
| q4_0 | 1152.00 | 28.1% | 0.004711 | 99.6% | 0.000301 | 0.008439 | 0.012949 | 0.033663 | 0.130419 | 89.8% | 21.636135 | 855.08 | 336.11 |
| q5_0-turbo3_tcq | 1120.00 | 27.3% | 0.005471 | 99.5% | 0.000265 | 0.010259 | 0.015229 | 0.038214 | 0.158514 | 87.3% | 22.268801 | 815.80 | 350.94 |
| q5_0-turbo3 | 1104.00 | 27.0% | 0.007097 | 99.3% | 0.000259 | 0.013747 | 0.020259 | 0.048761 | 0.192428 | 84.4% | 18.094296 | 837.90 | 342.47 |
| q4_1-turbo3_tcq | 1056.00 | 25.8% | 0.006184 | 99.4% | 0.000292 | 0.011652 | 0.017320 | 0.042997 | 0.174831 | 85.9% | 25.079035 | 816.95 | 350.43 |
| turbo4 | 1056.00 | 25.8% | 0.004760 | 99.6% | 0.000201 | 0.009046 | 0.013692 | 0.035205 | 0.138370 | 89.1% | 13.967494 | 705.32 | 401.18 |
| q4_0-turbo3_tcq | 992.00 | 24.2% | 0.006269 | 99.4% | 0.000270 | 0.012220 | 0.018173 | 0.045421 | 0.186572 | 84.9% | 23.157375 | 821.89 | 349.67 |
| q4_0-turbo3 | 976.00 | 23.8% | 0.008235 | 99.2% | 0.000336 | 0.015576 | 0.022828 | 0.056527 | 0.222154 | 82.0% | 24.353268 | 839.29 | 341.78 |
| q4_0-turbo2_tcq | 864.00 | 21.1% | 0.015168 | 98.5% | 0.000288 | 0.031826 | 0.045569 | 0.105461 | 0.395244 | 68.9% | 20.743238 | 826.07 | 347.04 |
| turbo3_tcq | 832.00 | 20.3% | 0.007978 | 99.2% | 0.000267 | 0.015663 | 0.023628 | 0.058286 | 0.227104 | 81.6% | 20.517471 | 795.20 | 359.09 |
| turbo3 | 800.00 | 19.5% | 0.011181 | 98.9% | 0.000304 | 0.022805 | 0.034209 | 0.082015 | 0.296060 | 76.1% | 22.977211 | 836.75 | 342.73 |
| turbo3_tcq-turbo2_tcq | 704.00 | 17.2% | 0.016386 | 98.4% | 0.000283 | 0.034186 | 0.049133 | 0.115072 | 0.437043 | 66.1% | 18.275532 | 796.16 | 358.86 |
| turbo3-turbo2 | 672.00 | 16.4% | 0.023985 | 97.7% | 0.000403 | 0.050100 | 0.072850 | 0.168258 | 0.605087 | 55.9% | 20.812553 | 831.88 | 344.85 |
| turbo2_tcq | 576.00 | 14.1% | 0.023073 | 97.8% | 0.000420 | 0.048777 | 0.071865 | 0.170350 | 0.632401 | 54.4% | 24.771320 | 807.25 | 354.12 |
| turbo2 | 544.00 | 13.3% | 0.036230 | 96.5% | 0.000465 | 0.078942 | 0.117545 | 0.276438 | 0.903576 | 41.5% | 26.508263 | 842.29 | 340.66 |
| Cache | KV cache (MiB) | Size vs bf16 | Mean KLD | Precision vs bf16 | KLD +/- | 90% KLD | 95% KLD | 99% KLD | 99.9% KLD | 99.9% precision vs bf16 | Maximum KLD | Tok/s | Elapsed (s) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| bf16 | 4096.00 | 100.0% | 0.000097 | 100.0% | 0.000020 | 0.000186 | 0.000398 | 0.001062 | 0.004152 | 100.0% | 2.345056 | 909.80 | 315.80 |
| q8_0 | 2176.00 | 53.1% | 0.000577 | 100.0% | 0.000073 | 0.000933 | 0.001428 | 0.004050 | 0.017372 | 98.7% | 8.130807 | 912.71 | 314.76 |
| q8_0-q5_1 | 1856.00 | 45.3% | 0.000836 | 99.9% | 0.000110 | 0.001291 | 0.001929 | 0.005272 | 0.021544 | 98.3% | 11.861989 | 895.23 | 320.91 |
| q8_0-q5_0 | 1792.00 | 43.8% | 0.000881 | 99.9% | 0.000107 | 0.001392 | 0.002092 | 0.005574 | 0.022435 | 98.2% | 10.867614 | 906.00 | 316.81 |
| q8_0-q4_1 | 1728.00 | 42.2% | 0.001317 | 99.9% | 0.000122 | 0.002436 | 0.003572 | 0.008974 | 0.034706 | 97.0% | 6.675264 | 818.78 | 346.30 |
| q8_0-q4_0 | 1664.00 | 40.6% | 0.001606 | 99.8% | 0.000118 | 0.002793 | 0.004084 | 0.009969 | 0.039299 | 96.5% | 8.993986 | 908.09 | 316.08 |
| q5_1 | 1536.00 | 37.5% | 0.001019 | 99.9% | 0.000075 | 0.001787 | 0.002724 | 0.007262 | 0.029854 | 97.5% | 6.707493 | 907.45 | 316.44 |
| q8_0-turbo3_tcq | 1504.00 | 36.7% | 0.003336 | 99.7% | 0.000119 | 0.006580 | 0.009451 | 0.022374 | 0.084818 | 92.3% | 11.130499 | 871.88 | 328.10 |
| q5_0 | 1408.00 | 34.4% | 0.001135 | 99.9% | 0.000088 | 0.002028 | 0.003113 | 0.008113 | 0.031348 | 97.3% | 8.001913 | 908.72 | 315.93 |
| q5_1-q4_1 | 1408.00 | 34.4% | 0.001683 | 99.8% | 0.000140 | 0.002928 | 0.004329 | 0.010956 | 0.038976 | 96.6% | 11.203689 | 906.39 | 316.62 |
| q5_0-q4_1 | 1344.00 | 32.8% | 0.001529 | 99.9% | 0.000037 | 0.003073 | 0.004593 | 0.011795 | 0.042828 | 96.2% | 2.328933 | 905.93 | 316.98 |
| q5_1-q4_0 | 1344.00 | 32.8% | 0.001813 | 99.8% | 0.000163 | 0.003236 | 0.004782 | 0.011742 | 0.042893 | 96.2% | 18.077213 | 905.51 | 317.08 |
| q4_1 | 1280.00 | 31.3% | 0.002316 | 99.8% | 0.000104 | 0.004441 | 0.006776 | 0.016734 | 0.068858 | 93.7% | 8.874204 | 913.32 | 314.50 |
| q5_0-q4_0 | 1280.00 | 31.3% | 0.001936 | 99.8% | 0.000147 | 0.003368 | 0.005013 | 0.012419 | 0.044393 | 96.1% | 14.364779 | 906.57 | 316.61 |
| q5_1-turbo3_tcq | 1184.00 | 28.9% | 0.003560 | 99.7% | 0.000130 | 0.007025 | 0.010176 | 0.024077 | 0.088706 | 91.9% | 10.154081 | 870.78 | 328.52 |
| q4_0 | 1152.00 | 28.1% | 0.002759 | 99.7% | 0.000141 | 0.005219 | 0.007950 | 0.019862 | 0.076663 | 93.0% | 10.045764 | 914.36 | 314.20 |
| q5_0-turbo3_tcq | 1120.00 | 27.3% | 0.003600 | 99.7% | 0.000099 | 0.007198 | 0.010485 | 0.024752 | 0.102109 | 90.7% | 9.033820 | 869.81 | 328.94 |
| q5_0-turbo3 | 1104.00 | 27.0% | 0.005209 | 99.5% | 0.000150 | 0.010602 | 0.015245 | 0.036580 | 0.134359 | 87.8% | 11.506174 | 894.86 | 320.42 |
| q4_1-turbo3_tcq | 1056.00 | 25.8% | 0.004226 | 99.6% | 0.000107 | 0.008465 | 0.012513 | 0.029618 | 0.121854 | 88.9% | 8.723818 | 871.93 | 328.15 |
| turbo4 | 1056.00 | 25.8% | 0.002988 | 99.7% | 0.000130 | 0.005881 | 0.008839 | 0.021868 | 0.076363 | 93.0% | 9.168183 | 744.85 | 379.35 |
| q4_0-turbo3_tcq | 992.00 | 24.2% | 0.004466 | 99.6% | 0.000123 | 0.009097 | 0.013470 | 0.032288 | 0.108662 | 90.1% | 9.383696 | 871.33 | 328.57 |
| q4_0-turbo3 | 976.00 | 23.8% | 0.006007 | 99.4% | 0.000136 | 0.012352 | 0.018044 | 0.042428 | 0.161644 | 85.4% | 11.092446 | 897.34 | 319.56 |
| q4_0-turbo2_tcq | 864.00 | 21.1% | 0.013595 | 98.7% | 0.000204 | 0.028687 | 0.041321 | 0.095393 | 0.367825 | 69.5% | 14.222522 | 881.14 | 325.07 |
| turbo3_tcq | 832.00 | 20.3% | 0.006038 | 99.4% | 0.000127 | 0.012714 | 0.019058 | 0.046219 | 0.172480 | 84.5% | 9.415985 | 845.55 | 337.45 |
| turbo3 | 800.00 | 19.5% | 0.009102 | 99.1% | 0.000164 | 0.019502 | 0.029311 | 0.068266 | 0.236472 | 79.3% | 10.847077 | 894.11 | 320.59 |
| turbo3_tcq-turbo2_tcq | 704.00 | 17.2% | 0.014461 | 98.6% | 0.000165 | 0.031014 | 0.045330 | 0.104559 | 0.374854 | 69.0% | 10.150832 | 847.59 | 336.76 |
| turbo3-turbo2 | 672.00 | 16.4% | 0.022168 | 97.8% | 0.000271 | 0.046698 | 0.068008 | 0.160327 | 0.602649 | 55.0% | 18.985191 | 884.98 | 323.44 |
| turbo2_tcq | 576.00 | 14.1% | 0.020739 | 98.0% | 0.000230 | 0.045497 | 0.068026 | 0.161256 | 0.538190 | 58.6% | 15.582352 | 861.17 | 331.75 |
| turbo2 | 544.00 | 13.3% | 0.034380 | 96.6% | 0.000340 | 0.075876 | 0.113734 | 0.265535 | 0.895385 | 41.0% | 19.482079 | 901.01 | 318.44 |
| Cache | KV cache (MiB) | Size vs bf16 | Mean KLD | Precision vs bf16 | KLD +/- | 90% KLD | 95% KLD | 99% KLD | 99.9% KLD | 99.9% precision vs bf16 | Maximum KLD | Tok/s | Elapsed (s) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| bf16 | 8192.00 | 100.0% | 0.000000 | 100.0% | 0.000000 | 0.000015 | 0.000023 | 0.000037 | 0.000051 | 100.0% | 0.000067 | 702.50 | 400.97 |
| q8_0 | 4352.00 | 53.1% | 0.000482 | 100.0% | 0.000007 | 0.000983 | 0.001508 | 0.004061 | 0.014951 | 98.5% | 0.478175 | 708.31 | 397.81 |
| q8_0-q5_1 | 3712.00 | 45.3% | 0.000651 | 99.9% | 0.000012 | 0.001335 | 0.002010 | 0.005161 | 0.018918 | 98.1% | 1.270212 | 694.87 | 405.20 |
| q8_0-q5_0 | 3584.00 | 43.8% | 0.000703 | 99.9% | 0.000013 | 0.001433 | 0.002141 | 0.005523 | 0.020360 | 98.0% | 1.166986 | 702.31 | 400.84 |
| q8_0-q4_1 | 3456.00 | 42.2% | 0.001149 | 99.9% | 0.000012 | 0.002453 | 0.003582 | 0.008568 | 0.029970 | 97.1% | 0.964733 | 637.52 | 440.27 |
| q8_0-q4_0 | 3328.00 | 40.6% | 0.001295 | 99.9% | 0.000016 | 0.002765 | 0.003998 | 0.009587 | 0.035741 | 96.5% | 1.614931 | 706.17 | 398.89 |
| q5_1 | 3072.00 | 37.5% | 0.000827 | 99.9% | 0.000008 | 0.001792 | 0.002687 | 0.006764 | 0.023291 | 97.7% | 0.496846 | 702.81 | 400.71 |
| q8_0-turbo3_tcq | 3008.00 | 36.7% | 0.003167 | 99.7% | 0.000029 | 0.006791 | 0.009860 | 0.022935 | 0.081350 | 92.2% | 2.329764 | 672.90 | 417.21 |
| q5_0 | 2816.00 | 34.4% | 0.000926 | 99.9% | 0.000007 | 0.002037 | 0.003067 | 0.007468 | 0.027410 | 97.3% | 0.427949 | 704.01 | 400.14 |
| q5_1-q4_1 | 2816.00 | 34.4% | 0.001335 | 99.9% | 0.000013 | 0.002884 | 0.004221 | 0.010047 | 0.035062 | 96.6% | 0.850439 | 703.75 | 400.05 |
| q5_0-q4_1 | 2688.00 | 32.8% | 0.001387 | 99.9% | 0.000013 | 0.003017 | 0.004437 | 0.010411 | 0.035706 | 96.5% | 0.971476 | 702.90 | 400.58 |
| q5_1-q4_0 | 2688.00 | 32.8% | 0.001485 | 99.9% | 0.000014 | 0.003200 | 0.004746 | 0.011342 | 0.040530 | 96.0% | 0.976738 | 702.58 | 400.94 |
| q4_1 | 2560.00 | 31.3% | 0.001933 | 99.8% | 0.000013 | 0.004318 | 0.006595 | 0.016046 | 0.050918 | 95.0% | 0.435837 | 709.38 | 397.11 |
| q5_0-q4_0 | 2560.00 | 31.3% | 0.001529 | 99.8% | 0.000016 | 0.003316 | 0.004868 | 0.011640 | 0.039033 | 96.2% | 1.116606 | 704.22 | 399.88 |
| q5_1-turbo3_tcq | 2368.00 | 28.9% | 0.003360 | 99.7% | 0.000029 | 0.007229 | 0.010496 | 0.025104 | 0.089474 | 91.4% | 2.237170 | 670.63 | 418.53 |
| q4_0 | 2304.00 | 28.1% | 0.002259 | 99.8% | 0.000017 | 0.005058 | 0.007697 | 0.018505 | 0.058301 | 94.3% | 1.074671 | 710.51 | 396.57 |
| q5_0-turbo3_tcq | 2240.00 | 27.3% | 0.003391 | 99.7% | 0.000030 | 0.007321 | 0.010567 | 0.024422 | 0.090901 | 91.3% | 2.252987 | 670.54 | 418.68 |
| q5_0-turbo3 | 2208.00 | 27.0% | 0.004728 | 99.5% | 0.000035 | 0.010375 | 0.014767 | 0.034198 | 0.121340 | 88.6% | 1.809964 | 693.49 | 405.71 |
| q4_1-turbo3_tcq | 2112.00 | 25.8% | 0.003981 | 99.6% | 0.000034 | 0.008612 | 0.012701 | 0.030111 | 0.112812 | 89.3% | 2.193686 | 672.84 | 417.30 |
| turbo4 | 2112.00 | 25.8% | 0.002605 | 99.7% | 0.000024 | 0.005701 | 0.008653 | 0.020367 | 0.071902 | 93.1% | 1.179263 | 519.76 | 531.96 |
| q4_0-turbo3_tcq | 1984.00 | 24.2% | 0.004131 | 99.6% | 0.000032 | 0.009062 | 0.013401 | 0.031435 | 0.112297 | 89.4% | 2.139016 | 671.63 | 418.13 |
| q4_0-turbo3 | 1952.00 | 23.8% | 0.005488 | 99.5% | 0.000040 | 0.012073 | 0.017644 | 0.040047 | 0.145158 | 86.5% | 1.545795 | 695.78 | 404.37 |
| q4_0-turbo2_tcq | 1728.00 | 21.1% | 0.013329 | 98.7% | 0.000090 | 0.029805 | 0.042793 | 0.096811 | 0.306630 | 73.6% | 6.779726 | 678.10 | 414.30 |
| turbo3_tcq | 1664.00 | 20.3% | 0.005708 | 99.4% | 0.000045 | 0.012748 | 0.019441 | 0.045320 | 0.150144 | 86.1% | 2.079510 | 647.62 | 432.43 |
| turbo3 | 1600.00 | 19.5% | 0.008334 | 99.2% | 0.000057 | 0.019023 | 0.028596 | 0.066157 | 0.207468 | 81.3% | 2.454834 | 691.41 | 406.63 |
| turbo3_tcq-turbo2_tcq | 1408.00 | 17.2% | 0.014344 | 98.6% | 0.000086 | 0.032243 | 0.046737 | 0.105269 | 0.343951 | 70.9% | 4.010866 | 648.06 | 432.26 |
| turbo3-turbo2 | 1344.00 | 16.4% | 0.020468 | 98.0% | 0.000154 | 0.045514 | 0.065976 | 0.147745 | 0.474415 | 62.2% | 11.938387 | 686.57 | 409.17 |
| turbo2_tcq | 1152.00 | 14.1% | 0.019857 | 98.0% | 0.000122 | 0.045491 | 0.068398 | 0.158766 | 0.453761 | 63.5% | 4.370085 | 656.84 | 426.67 |
| turbo2 | 1088.00 | 13.3% | 0.032631 | 96.8% | 0.000203 | 0.073833 | 0.111765 | 0.261607 | 0.838113 | 43.3% | 4.735642 | 698.21 | 402.96 |