Anbeeld

Projects Articles Support Contact

KV Cache Quantization Benchmarks for Long Context

Tests on Qwen 3.6 27B show why TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM.

  1. The Setup
  2. PPL Hides the Tail, KLD Exposes It
  3. Rotation Changed the Playing Field
  4. What TCQ Adds
  5. Asymmetric KV: Where the Bits Should Go
  6. TurboQuant's Narrowed Niche
  7. The q8 Fidelity Tier
  8. The Preset Ladder
  9. Weight Quant Changes the Cost of Cache Quant
  10. What Else Was Tried
  11. Benchmark Data
    1. Perplexity
    2. KL Divergence
  12. q6_0 Follow-Up Benchmarks
    1. Perplexity
    2. KL Divergence
  13. q6_0 Follow-Up Analysis
    1. The q6 Slot Is q6 K + q5 V
    2. The q5 V Range Is Cheap, But Not Free
    3. turbo4 Still Loses the Tie-Breaker
    4. Updated Preset Ladder
  14. bf16-K Follow-Up Benchmarks
  15. bf16-K Follow-Up Analysis
    1. What The bf16-K Rows Test
    2. q8 V Does Not Earn Full K
    3. q6 V Is The High-End Ceiling
    4. q5 V Is The Practical Middle
    5. q4 V Is The Drop Full K Cannot Hide
    6. turbo V Stays Below The Normal Ladder
    7. What This Changes
  16. bf16 vs f16 Follow-Up Benchmarks
    1. Perplexity
    2. KL Divergence
  17. bf16 vs f16 Follow-Up Analysis
    1. Why The f32 Baseline Matters
    2. bf16 Beats f16 At The Same Size
    3. Precision Wins, Speed Is Mostly A Wash
    4. What This Changes

Disclaimer: this is just results of a few basic KV cache benchmarks, not the whole truth. Longer context should make the differences more pronounced, but that still doesn't guarantee 1:1 with real usage. Maybe TurboQuant magic works better with actually getting tool calls correctly at extra long context in agentic coding, or maybe it's the other way around. I tried some other tests too, but in terms of meaningful results PPL and KLD are all I've got.

1. The Setup

This benchmark started with a narrow question: does TurboQuant-style KV cache compression have any defensible niche in BeeLlama's (my llama.cpp fork) recommendation set, or is q4/q5 strictly better where it fits?

Hardware: one RTX 3090 (24 GB VRAM), Ryzen 7 5700X3D, 32 GB RAM. Model: Qwen 3.6 27B. Q5_K_S weights plus a 64k bf16 KV cache fit on the card; 128k does not, the weights alone fill too much VRAM. IQ4_XS weights are smaller, so IQ4_XS plus a 128k bf16 KV cache fits. Each context length is capped at the maximum where bf16 KV still runs, because bf16 is the reference every quantized mode is measured against. IQ4_XS at 64k is also included to compare directly against Q5_K_S at the same 64k context, isolating the effect of weight quant on cache quality (§9).

The tok/s column in the benchmark data tables is prefill throughput (batch size 2048, ubatch 256), not generation speed. GPU power was lowered during the runs. These numbers are useful only for relative comparison between cache modes on the same hardware, the absolute values do not represent real-world generation performance.

The tests ran on BeeLlama.cpp v0.1.2, which keeps the normal llama.cpp flow but adds DFlash plus TurboQuant/TCQ KV-cache compression. The TurboQuant implementation being evaluated comes from TheTom's llama-cpp-turboquant, while the TCQ variants come from buun's llama.cpp fork and the accompanying TCQ paper and codebooks.

2. PPL Hides the Tail, KLD Exposes It

The PPL numbers look flat across the board. Even turbo modes barely move the average:

CacheSize vs bf16Q5_K_S 64kIQ4_XS 64kIQ4_XS 128k
bf16100%5.48005.51695.2724
q8_053.1%5.47745.51575.2716
q5_137.5%5.47775.51815.2723
q5_034.4%5.48025.51755.2738
q4_131.3%5.48085.52375.2772
q4_028.1%5.48775.52515.2803
turbo425.8%5.48415.52775.2822
turbo3_tcq20.3%5.50545.54265.2985
turbo319.5%5.51495.55615.3084
turbo2_tcq14.1%5.57055.60855.3513
turbo213.3%5.64035.68235.4287

Full PPL results for all cache modes and configurations are in §11.1.

q5_0 uses 34.4% of bf16 KV and saves 65% memory; q4_0 uses 28.1%. Through q4_0, the entire PPL range is under 0.01. Even turbo3_tcq at 20.3% only adds about 0.02. turbo2 at 13.3% shows a visible hit, but even that is under 0.17 PPL absolute. If PPL were the whole story, the recommendation would be simple: compress aggressively and move on.

PPL averages over every token equally, so one position that destroys a JSON key or hallucinates a closing brace gets diluted by thousands of unremarkable tokens. The metric that picks up that tail damage is KL divergence against the bf16 baseline — specifically the 99.9% KLD, which measures the worst 0.1% of positions. The 99.9% precision column converts it to a percentage: 100 * exp(-(quantKLD - bf16KLD)).

The Q5_K_S 64k KL table shows what PPL hides. Mean precision stays above 99% for almost every mode, but the 99.9% column tells a different story:

CacheSize vs bf16Mean KLD99% KLD99.9% KLD99.9% prec.Tok/s
bf16100.0%0.0003750.0052340.023258100.00%850.81
q8_053.1%0.0023280.0196690.07870994.61%851.11
q8_0-q5_145.3%0.0025290.0203460.08288094.21%828.63
q8_0-q5_043.8%0.0026560.0210390.08848693.69%847.33
q8_0-q4_142.2%0.0030800.0233900.09908092.70%786.54
q8_0-q4_040.6%0.0033160.0248920.10468092.18%849.37
q5_137.5%0.0029110.0226040.09835492.77%841.65
q8_0-turbo3_tcq36.7%0.0050900.0370560.14938788.15%817.57
q5_034.4%0.0032060.0227590.09907392.70%849.79
q5_1-q4_134.4%0.0033800.0258860.09501193.08%846.27
q5_0-q4_132.8%0.0034710.0258290.09961892.65%847.59
q5_1-q4_032.8%0.0036260.0256680.10864991.82%846.91
q4_131.3%0.0044760.0311660.14181388.82%854.33
q5_0-q4_031.3%0.0035810.0274230.11333291.39%847.64
q5_1-turbo3_tcq28.9%0.0055940.0381750.14459188.57%816.05
q4_028.1%0.0047110.0336630.13041989.84%855.08
q5_0-turbo3_tcq27.3%0.0054710.0382140.15851487.35%815.80
q5_0-turbo327.0%0.0070970.0487610.19242884.44%837.90
q4_1-turbo3_tcq25.8%0.0061840.0429970.17483185.94%816.95
turbo425.8%0.0047600.0352050.13837089.13%705.32
q4_0-turbo3_tcq24.2%0.0062690.0454210.18657284.93%821.89
q4_0-turbo323.8%0.0082350.0565270.22215481.96%839.29
q4_0-turbo2_tcq21.1%0.0151680.1054610.39524468.94%826.07
turbo3_tcq20.3%0.0079780.0582860.22710481.56%795.20
turbo319.5%0.0111810.0820150.29606076.12%836.75
turbo3_tcq-turbo2_tcq17.2%0.0163860.1150720.43704366.11%796.16
turbo3-turbo216.4%0.0239850.1682580.60508755.89%831.88
turbo2_tcq14.1%0.0230730.1703500.63240154.38%807.25
turbo213.3%0.0362300.2764380.90357641.47%842.29

Full KL tables for all three configurations are in §11.2.

The mean KLD column still looks reasonable: most modes stay below 0.01. But the 99.9% column diverges sharply. q5_0 at 34.4% of bf16 KV has a 99.9% KLD of 0.099, which is 42× its mean. q4_0 at 28.1% jumps to 0.130, which looks small as a number but is a 32% increase over q5_0 in the tail, and this will break your tool calls. Below q4_0, turbo modes fall off a cliff: turbo3_tcq at 20.3% of bf16 KV reaches 0.227 in the tail, and turbo2 at 13.3% hits 0.904, so roughly one full nat of divergence at the worst one-in-a-thousand positions.

The size column makes the comparison direct. q4_0 saves 71.9% of bf16 KV and keeps 89.84% of 99.9% precision. turbo3_tcq saves 79.7% and keeps 81.56%. turbo2 saves 86.7% and collapses to 41.47%. The PPL table suggested turbo3_tcq was fine; the tail says it is not, unless you genuinely need the memory and can accept the risk. And then turbo2 is only viable for tasks where exact structure does not matter.

3. Rotation Changed the Playing Field

The q4/q5/q8 modes in these results are not naive scalar quantization. llama.cpp applies a random rotation to KV cache vectors before quantizing them — the same trick TurboQuant uses. The rotation spreads outlier energy across coordinates, so the scalar quantizer faces a more uniform distribution and makes fewer catastrophic rounding decisions on extreme values. The difference is what happens after rotation: q4/q5 quantize each value independently to 4–5 bits with a scalar codebook, while TurboQuant quantizes to 2–3 bits with a scalar codebook and optionally adds a QJL residual for inner-product estimation. TCQ (§4) constrains index sequences through a trellis instead.

TurboQuant is also slower. turbo4 runs at ~705 tok/s (prefill) versus ~850 tok/s for q4_0 on the Q5_K_S config, a 17% throughput penalty. turbo3_tcq runs at ~794–844 tok/s across the three configs, slower than q4_0 in every case. The rotation and QJL stages add compute that scalar quantization avoids.

The rotation overlap means TurboQuant is not competing against a helpless baseline. It is competing against q4/q5 modes that already benefit from the same outlier smoothing. At 2–3 bits, the scalar codebook is starved enough that rotation alone cannot save you, and TurboQuant's extra structure matters. At 4 bits, turbo4 has no quality advantage over q4_0, saves almost no memory (25.8% vs 28.1% of bf16 KV), and runs slower. The value in TurboQuant is at the low bit rates where q4/q5 cannot reach at all.

4. What TCQ Adds

TCQ changes the low-bit side of the table. buun's paper ("Closing the Gap: Trellis-Coded Quantization for KV Cache at 2–3 Bits") describes TCQ as the first application of trellis-coded quantization to LLM KV cache compression. Instead of choosing each scalar index independently, TCQ constrains index sequences through a finite-state trellis, which gives a much larger effective codebook at the same bit rate. The encoder uses Viterbi to find a globally optimal assignment, while the decode path stays parallel: each value can be decoded in O(1), which is why the method can still fit GPU flash attention kernels.

The reported TCQ results claim 10–44% KL-divergence reduction over scalar quantization at 2–3 bits per value, with context-adaptive norm scaling and FWHT rotation plus random sign flips. That matches what this BeeLlama bench sees directionally: turbo3_tcq is consistently much better than plain turbo3, and turbo2_tcq is much better than plain turbo2, especially in the 99.9% tail.

There is no turbo4_tcq in this table, and that makes sense. TCQ is most valuable where independent scalar codes are starved for bits. At 2–3 bits, the larger effective codebook can close a visible gap. At 4 bits, the ordinary scalar codebook already has enough resolution, and the extra trellis decode overhead is not worth paying for marginal quality gains on top of a mode that is already uncompetitive with q4_0.

5. Asymmetric KV: Where the Bits Should Go

The asymmetric rows are the most useful part of the final report. They confirm the K-first intuition, but with a limit.

At 31.3% of bf16 KV, q5_0-q4_0 has the same size as symmetric q4_1, yet it beats q4_1 across all three 99.9% precision tables: 91.39% vs 88.82% in Q5_K_S 64k, 96.06% vs 93.73% in IQ4_XS 64k, and 96.18% vs 95.04% in IQ4_XS 128k. Spending the same budget of bits on a stronger K and weaker V outperforms splitting them evenly.

q5_0-q4_0 is the clean trade at that size: same footprint as q4_1, but better tail behavior from spending bits asymmetrically toward K. One step up, q5_0-q4_1 costs 32.8% of bf16 KV, only 1.5 points more than q5_0-q4_0, and improves the 99.9% tail to 92.65%, 96.21%, and 96.50% across the three KL tables. Not a huge jump, but it is cheap.

The direct comparison between q5_0-q4_1 and q5_1-q4_0 answers the next-bit question. Both are the same size. q5_0-q4_1 is better in Q5_K_S and IQ4_XS 128k, while IQ4_XS 64k is effectively tied in 99.9% precision. After K reaches q5_0, the next useful bit appears to go to V, not to q5_1 K.

q5_1 remains useful as an extra-conservative option, but it is not where the marginal value is.

6. TurboQuant's Narrowed Niche

The bench is harsh on turbo modes, but it does not invalidate them. I'd say it just clearly defines their niche, albeit one that is much more narrow than I expected.

turbo4 is the easiest mode to drop from recommendations. At IQ4_XS 128k, it saves only 192 MiB over q4_0, 2112 MiB vs 2304 MiB, while throughput drops from 710.51 tok/s to 519.76 tok/s and 99.9% precision falls from 94.34% to 93.07%. That is not a good exchange.

Plain turbo3 and turbo2 are also hard to recommend when TCQ variants are available. turbo3_tcq beats turbo3 clearly in the tail, and turbo2_tcq is much better than turbo2. Hardware support matters here: if a backend does not support the TCQ variants, users have to choose between paying more VRAM for q4/q5 or accepting the quality loss from the plain turbo modes. But that should be a documented fallback, not the main path.

The valuable turbo result is turbo3_tcq. It lands at 20.3% of bf16 KV, with 99.9% precision of 81.56%, 84.51%, and 86.06% across the three KL tables. That is not q4 quality, and it should not be sold as such. But it is a real precision/VRAM/speed compromise for users who need the context to fit.

turbo2_tcq is the last resort. It keeps broad PPL better than the tail suggests, but the 99.9% precision values are 54.38%, 58.62%, and 63.53%. That is a mode for rough summarization, long-context reading where exactness is not the job, or setups that otherwise cannot run. It is not a tool-call, JSON, code-edit, or math mode.

There is also a small middle rung: q4_0-turbo3_tcq. It uses 24.2% of bf16 KV and sits between q4_0 and symmetric turbo3_tcq in the 99.9% tail. It too is not a precision preset, but it's useful for users who need something smaller than q4 and cleaner than full turbo3_tcq.

7. The q8 Fidelity Tier

Symmetric q8_0 at 53.1% of bf16 KV sits at the top of the size range, and its numbers show it. Mean precision is 99.80% on Q5_K_S, 99.95% on IQ4_XS, and PPL precision actually exceeds bf16 (100.05% on Q5_K_S, 100.02% on IQ4_XS): the quantization noise is small enough that median perplexity changes by less than run-to-run variance. The 99.9% tail tells a slightly different story: 94.61% on Q5_K_S, 98.69% on IQ4_XS 64k, 98.52% on 128k.

Those are the highest tail numbers in the bench, but they are not perfect, and the Q5_K_S gap from 100% is visible. Combined with VRAM costs, this makes full q8_0-q8_0 a validation and blame-isolation mode, not a practical default.

The asymmetric pairs with q8_0 K walk the V down one tier at a time, and each step has a cost that mirrors the V-tier structure the rest of the bench already found. On Q5_K_S 64k:

IQ4_XS shifts the absolute numbers but preserves the step structure. V steps inside q5 cost 0.09–0.14 points, the q5-to-q4 cliff costs 0.94–1.20 points, and the q4_0-to-turbo3_tcq drop is 4.30 points on both 64k and 128k. The same V-tier boundaries that the symmetric and q5-asymmetric modes already showed also govern the q8-tier walkdown.

That walkdown makes the fidelity tier's practical boundaries clear. At q8_0 K, V values inside q5 cost less than half a point per tier and stay above 93.6% tail precision; crossing below q5 V costs a full point. The two useful presets that balance VRAM and precision well are q8_0-q5_1 and q8_0-q5_0.

q8_0-q5_1 at 45.3% of bf16 KV is the best precision-per-size trade at the high end. It gives up only 0.40 points of 99.9% tail precision versus symmetric q8_0 (94.21% vs 94.61%) while shrinking the cache by 7.8 percentage points. On IQ4_XS, the loss is just 0.41 points on 64k (98.28% vs 98.69%) and 0.39 on 128k. Mean precision stays above 99.78% on both weight quants. That is the fidelity preset: the smallest q8-tier config that stays above 94% tail precision, and the one that earns the bold row in §8.

q8_0-q5_0 at 43.8% saves another 1.5% of bf16 KV but costs 0.52 more tail points (93.69% vs 94.21%). It exists as a fallback if q8_0-q5_1 does not fit by a narrow margin, not as a peer recommendation. Where dropping V from q8_0 to q5_1 costs 0.40 points for a 7.8% size reduction, dropping from q5_1 to q5_0 costs 0.52 points for only 1.5%.

Below q5 V, the returns collapse. q8_0-q4_1 at 42.2% loses 0.99 points from q8_0-q5_0 for 1.6% less size. q8_0-q4_0 at 40.6% loses another 0.52 points. Neither beats symmetric q5_0 (34.4%, 92.70%) in the tail despite being 6–8% larger, because the q5-to-q4 V cliff erases the q8_0 K advantage. The comparison with symmetric q5_1 makes this concrete: q8_0-q4_1 is 4.7% larger yet nearly tied with q5_1 at 92.70% vs 92.77%.

The q8_0 K tier buys back nearly all of the q4_1 V penalty, confirming that K precision dominates the tail, but the net result is no better than a fully symmetric pair that costs less. The same pattern holds on IQ4_XS (q8_0-q4_1 at 96.99% is 0.47 points below q5_1 at 97.46% on 64k, 0.65 on 128k). q8_0-q4_1 is more interesting technically than practically (its Q5_K_S speed of 786.54 tok/s is anomalous; other asymmetric pairs at similar sizes and the IQ4_XS re-run both show normal throughput), as it clearly shows the limits of K-vs-V trade-offs.

8. The Preset Ladder

The old recommendation set treated TurboQuant too broadly. The new one should separate ordinary precision modes, fidelity modes, and long-context survival modes.

K / V% of bf16 size99.9% precisionWhat it is for
bf16 / bf16100100.00%Preserving full quality
q8_0 / q8_053.194.61%Compression with minimal losses
q8_0 / q5_145.394.21%Best precision/size at high end
q8_0 / q5_043.893.69%If q8_0 / q5_1 doesn't fit just a bit
q5_0 / q5_034.492.70%Good precision/size, relatively safe
q5_0 / q4_132.892.65%Best default if VRAM-constrained
q5_0 / q4_031.391.39%If q5_0 / q4_1 doesn't fit just a bit
q4_0 / q4_028.189.84%Memory saving with precision loss
q4_0 / turbo3_tcq24.284.93%Smaller than q4, cleaner than turbo3_tcq
turbo3_tcq / turbo3_tcq20.381.56%Viable as extreme compression
turbo2_tcq / turbo2_tcq14.154.38%Last resort: not for code and tool calls

The modes I would not present as first-class recommendations are q4_1/q4_1, q5_1/q4_1, q5_1/q4_0, q8_0/q4_0, q8_0/turbo3_tcq, q5_0/turbo3_tcq, q5_1/turbo3_tcq, q4_1/turbo3_tcq, q4_0/turbo2_tcq, plain turbo3, plain turbo2, and turbo4 unless its speed/implementation profile changes.

That is not a rejection of TurboQuant, but a better boundary. If q5 or q4 fits, use q5 or q4. If the job is strict and tool-heavy, prefer q5_0. If the job is memory-bound but still ordinary, use q5_0/q4_0 or q5_0/q4_1. If the job cannot fit without deeper KV compression, that is where turbo3_tcq earns its place. If even that does not fit, turbo2_tcq exists, but the user should know what they are giving up.

The benchmark does not show "Q beats TQ" or "TQ beats Q", but that the line between them has moved. Rotation made ordinary q4/q5 quite decent, TCQ makes very low-bit turbo modes much more viable than plain turbo2/turbo3, and the user wins because there's a somewhat clear preset ladder with obvious trade-offs.

9. Weight Quant Changes the Cost of Cache Quant

The bench ran two weight quantizations: Q5_K_S and IQ4_XS. Same model, same context lengths where they overlap, same cache modes. The comparison is a controlled variable that most KV cache benchmarks do not isolate. It should be: weight precision changes how much cache quantization costs in the tail.

At 64k context, most symmetric cache modes land 3–5% higher in 99.9% precision on IQ4_XS than on Q5_K_S (turbo3_tcq is just under 3%, and turbo2 has already collapsed on both):

CacheQ5_K_S 99.9% prec.IQ4_XS 99.9% prec.Gap
q8_094.61%98.69%+4.08
q5_092.70%97.32%+4.62
q4_089.84%93.01%+3.17
turbo489.13%93.03%+3.90
turbo3_tcq81.56%84.51%+2.95
turbo376.12%79.27%+3.15
turbo2_tcq54.38%58.62%+4.24
turbo241.47%41.01%-0.46

IQ4_XS is consistently less damaged by the same cache quant. The gap exists across almost every mode and only collapses at turbo2, where both weight quants are already in the floor.

The raw KLD numbers make the gap sharper. Q5_K_S q8_0 has a 99.9% KLD of 0.078709; IQ4_XS q8_0 has 0.017372. Same cache mode, 4.5× the tail damage on the higher-precision weights. At q4_0 the ratio drops to 1.7× (0.130419 vs 0.076663): the cache quantization noise starts to dominate over the weight quantization the deeper you compress.

The reason is in what those KV values carry. Q5_K_S produces richer KV distributions: structurally important tokens, sharp probability differences, outlier activations with real signal. IQ4_XS already injects weight noise into the attention path, so its KV distributions are smoother and carry less fine detail.

When you quantize those KV values, you lose more from Q5_K_S because there is more to lose. The absolute output quality of Q5_K_S with bf16 KV is still better, but KL divergence is measured against each model's own bf16 baseline, so a Q5_K_S model at q4_0 cache has moved further from its own potential than an IQ4_XS model at q4_0 has from its (lower) potential.

The practical resolution: the same cache preset is tail-safer on a lower-weight-precision model than on a higher-weight-precision model. If you are running Q5_K_S, you should lean harder toward q5_0 over q4_0 than you would on IQ4_XS.

10. What Else Was Tried

Wikitext PPL and KL divergence are the only benchmarks that made it into this article, but several other tests were tried and abandoned because they could not distinguish between cache quantization modes.

Multiple-choice benchmarks (ARC-Challenge, ARC-Easy, HellaSwag, MMLU) all run at short context, well under 10K tokens. The KV cache is tiny at those lengths, and every cache mode scores within noise of every other mode. They measure whether the model knows things, not whether quantized cache degrades retrieval or reasoning. The spread across cache types was indistinguishable from run-to-run variance.

Perplexity-based HellaSwag and Winogrande have the same problem: they do not exercise the KV cache enough to show a difference. They confirm that the model still speaks English after cache quantization, which was never in doubt.

Synthetic passkey and needle-in-a-haystack retrieval were tried at 32K context. Every cache type, including turbo2, scored 100%. The model can regurgitate a hidden string just fine even with aggressively quantized cache, because retrieval of a single attended token is a different failure mode than slowly diverging output distributions over thousands of tokens. A 30-question passkey test does not have the statistical power to catch what KL divergence catches.

JSON schema generation (structured output constrained by a schema) also hit 100% for all cache types. Cache quantization does not break template compliance at the scale tested.

Native passkey via llama-passkey was excluded because it requires the model's native context length of 262144 tokens, which OOMs on 24 GB VRAM and has no --ctx-size override.

AIME (30 mathematics competition problems) is the only discarded benchmark that could show a difference. The problem is cost: on an RTX 3090, one pass through the 30 problems for a single cache configuration would take roughly 3–4 days of non-stop inference. I committed to it anyway and ran bf16, q8_0, and q4_0 before realizing that a single pass is nearly useless: 30 questions is wildly noisy for a binary measure of correctness, with q8_0 and q4_0 getting the same result. You need dozens of runs to average out the variance, and by the time that finishes, AGI will have already been invented and the whole topic of local inference optimization will be moot.


11. Benchmark Data

11.1 Perplexity

Q5_K_S 64k

CacheKV cache (MiB)Size vs bf16Median PPLPrecision vs bf16PPL +/-Same top pTok/sElapsed (s)
bf164096.00100.0%5.4800100.00%0.0346599.647% +/- 0.016%851.75326.30
q8_02176.0053.1%5.4774100.05%0.0346597.942% +/- 0.039%851.57331.27
q5_11536.0037.5%5.4777100.04%0.0346497.787% +/- 0.041%848.27332.64
q5_01408.0034.4%5.4802100.00%0.0346697.707% +/- 0.041%848.36332.45
q4_11280.0031.3%5.480899.99%0.0346797.259% +/- 0.045%853.49330.43
q4_01152.0028.1%5.487799.86%0.0347397.179% +/- 0.046%853.50330.76
turbo41056.0025.8%5.484199.93%0.0346897.037% +/- 0.047%705.06395.16
turbo3_tcq832.0020.3%5.505499.54%0.0348096.265% +/- 0.052%794.21353.31
turbo3800.0019.5%5.514999.37%0.0349395.517% +/- 0.057%802.71344.83
turbo2_tcq576.0014.1%5.570598.38%0.0356693.456% +/- 0.068%805.25348.78
turbo2544.0013.3%5.640397.16%0.0358191.646% +/- 0.076%840.34335.07

IQ4_XS 64k

CacheKV cache (MiB)Size vs bf16Median PPLPrecision vs bf16PPL +/-Same top pTok/sElapsed (s)
bf164096.00100.0%5.5169100.00%0.0349799.776% +/- 0.013%909.83336.43
q8_02176.0053.1%5.5157100.02%0.0349998.950% +/- 0.028%910.43309.49
q5_11536.0037.5%5.518199.98%0.0350198.618% +/- 0.032%906.38310.47
q5_01408.0034.4%5.517599.99%0.0350098.553% +/- 0.033%906.88310.27
q4_11280.0031.3%5.523799.88%0.0350597.880% +/- 0.040%911.35308.85
q4_01152.0028.1%5.525199.85%0.0350597.793% +/- 0.041%912.54308.61
turbo41056.0025.8%5.527799.80%0.0350897.652% +/- 0.042%746.00372.98
turbo3_tcq832.0020.3%5.542699.54%0.0351396.569% +/- 0.050%844.33331.64
turbo3800.0019.5%5.556199.29%0.0353395.746% +/- 0.056%882.76318.29
turbo2_tcq576.0014.1%5.608598.37%0.0359993.669% +/- 0.067%858.66326.24
turbo2544.0013.3%5.682397.09%0.0362191.865% +/- 0.076%898.66312.75

IQ4_XS 128k

CacheKV cache (MiB)Size vs bf16Median PPLPrecision vs bf16PPL +/-Same top pTok/sElapsed (s)
bf168192.00100.0%5.2724100.00%0.0326999.995% +/- 0.002%703.61389.94
q8_04352.0053.1%5.2716100.02%0.0327198.950% +/- 0.028%707.02387.84
q5_13072.0037.5%5.2723100.00%0.0327198.603% +/- 0.032%702.53390.33
q5_02816.0034.4%5.273899.97%0.0327298.543% +/- 0.033%703.18390.05
q4_12560.0031.3%5.277299.91%0.0327697.961% +/- 0.039%708.53387.23
q4_02304.0028.1%5.280399.85%0.0327697.793% +/- 0.041%709.40386.58
turbo42112.0025.8%5.282299.81%0.0328197.639% +/- 0.042%520.46520.82
turbo3_tcq1664.0020.3%5.298599.51%0.0328196.591% +/- 0.050%647.40421.95
turbo31600.0019.5%5.308499.32%0.0330195.861% +/- 0.055%689.73396.86
turbo2_tcq1152.0014.1%5.351398.53%0.0336393.807% +/- 0.067%655.44416.90
turbo21088.0013.3%5.428797.12%0.0338692.033% +/- 0.075%696.53393.24

11.2 KL Divergence

KL precision uses 100 * exp(-(quantKLD - bf16KLD)). The 99.9% precision column applies the same formula to 99.9% KLD. Symmetric and asymmetric cache configurations are combined in each table; asymmetric cache names are K-V.

Q5_K_S 64k

CacheKV cache (MiB)Size vs bf16Mean KLDPrecision vs bf16KLD +/-90% KLD95% KLD99% KLD99.9% KLD99.9% precision vs bf16Maximum KLDSame top pTok/sElapsed (s)
bf164096.00100.0%0.000375100.00%0.0000580.0005680.0016930.0052340.023258100.00%7.37404699.647% +/- 0.016%850.81337.59
q8_02176.0053.1%0.00232899.80%0.0001250.0042330.0066560.0196690.07870994.61%14.35599697.942% +/- 0.039%851.11337.66
q8_0-q5_11856.0045.3%0.00252999.78%0.0001430.0045570.0071080.0203460.08288094.21%15.36768397.884% +/- 0.040%828.63346.71
q8_0-q5_01792.0043.8%0.00265699.77%0.0001680.0046730.0073480.0210390.08848693.69%17.98765097.826% +/- 0.040%847.33338.90
q8_0-q4_11728.0042.2%0.00308099.73%0.0001150.0056450.0085870.0233900.09908092.70%8.07323197.655% +/- 0.042%786.54364.58
q8_0-q4_01664.0040.6%0.00331699.71%0.0001650.0059760.0090750.0248920.10468092.18%13.48150697.532% +/- 0.043%849.37338.13
q5_11536.0037.5%0.00291199.75%0.0001670.0050450.0079160.0226040.09835492.77%13.39706897.787% +/- 0.041%841.65341.63
q8_0-turbo3_tcq1504.0036.7%0.00509099.53%0.0001880.0097360.0144010.0370560.14938788.15%20.12875296.899% +/- 0.048%817.57350.23
q5_01408.0034.4%0.00320699.72%0.0002860.0052320.0081940.0227590.09907392.70%22.61989297.707% +/- 0.041%849.79338.00
q5_1-q4_11408.0034.4%0.00338099.70%0.0001950.0061400.0094790.0258860.09501193.08%21.39401197.529% +/- 0.043%846.27339.25
q5_0-q4_11344.0032.8%0.00347199.69%0.0002060.0063100.0095820.0258290.09961892.65%21.86311797.539% +/- 0.043%847.59339.65
q5_1-q4_01344.0032.8%0.00362699.68%0.0002120.0064410.0097730.0256680.10864991.82%15.80972697.515% +/- 0.043%846.91339.23
q4_11280.0031.3%0.00447699.59%0.0002670.0077160.0119010.0311660.14181388.82%18.15086997.259% +/- 0.045%854.33336.49
q5_0-q4_01280.0031.3%0.00358199.68%0.0001740.0066000.0100580.0274230.11333291.39%14.93859997.437% +/- 0.044%847.64338.79
q5_1-turbo3_tcq1184.0028.9%0.00559499.48%0.0002910.0102640.0153240.0381750.14459188.57%24.68442996.878% +/- 0.048%816.05350.73
q4_01152.0028.1%0.00471199.57%0.0003010.0084390.0129490.0336630.13041989.84%21.63613597.179% +/- 0.046%855.08336.11
q5_0-turbo3_tcq1120.0027.3%0.00547199.49%0.0002650.0102590.0152290.0382140.15851487.35%22.26880196.865% +/- 0.048%815.80350.94
q5_0-turbo31104.0027.0%0.00709799.33%0.0002590.0137470.0202590.0487610.19242884.44%18.09429696.331% +/- 0.052%837.90342.47
q4_1-turbo3_tcq1056.0025.8%0.00618499.42%0.0002920.0116520.0173200.0429970.17483185.94%25.07903596.663% +/- 0.050%816.95350.43
turbo41056.0025.8%0.00476099.55%0.0002010.0090460.0136920.0352050.13837089.13%13.96749497.037% +/- 0.047%705.32401.18
q4_0-turbo3_tcq992.0024.2%0.00626999.41%0.0002700.0122200.0181730.0454210.18657284.93%23.15737596.622% +/- 0.050%821.89349.67
q4_0-turbo3976.0023.8%0.00823599.22%0.0003360.0155760.0228280.0565270.22215481.96%24.35326896.075% +/- 0.054%839.29341.78
q4_0-turbo2_tcq864.0021.1%0.01516898.53%0.0002880.0318260.0455690.1054610.39524468.94%20.74323894.591% +/- 0.062%826.07347.04
turbo3_tcq832.0020.3%0.00797899.24%0.0002670.0156630.0236280.0582860.22710481.56%20.51747196.265% +/- 0.052%795.20359.09
turbo3800.0019.5%0.01118198.93%0.0003040.0228050.0342090.0820150.29606076.12%22.97721195.517% +/- 0.057%836.75342.73
turbo3_tcq-turbo2_tcq704.0017.2%0.01638698.41%0.0002830.0341860.0491330.1150720.43704366.11%18.27553294.379% +/- 0.064%796.16358.86
turbo3-turbo2672.0016.4%0.02398597.67%0.0004030.0501000.0728500.1682580.60508755.89%20.81255393.154% +/- 0.070%831.88344.85
turbo2_tcq576.0014.1%0.02307397.76%0.0004200.0487770.0718650.1703500.63240154.38%24.77132093.456% +/- 0.068%807.25354.12
turbo2544.0013.3%0.03623096.48%0.0004650.0789420.1175450.2764380.90357641.47%26.50826391.646% +/- 0.076%842.29340.66

IQ4_XS 64k

CacheKV cache (MiB)Size vs bf16Mean KLDPrecision vs bf16KLD +/-90% KLD95% KLD99% KLD99.9% KLD99.9% precision vs bf16Maximum KLDSame top pTok/sElapsed (s)
bf164096.00100.0%0.000097100.00%0.0000200.0001860.0003980.0010620.004152100.00%2.34505699.776% +/- 0.013%909.80315.80
q8_02176.0053.1%0.00057799.95%0.0000730.0009330.0014280.0040500.01737298.69%8.13080798.950% +/- 0.028%912.71314.76
q8_0-q5_11856.0045.3%0.00083699.93%0.0001100.0012910.0019290.0052720.02154498.28%11.86198998.814% +/- 0.030%895.23320.91
q8_0-q5_01792.0043.8%0.00088199.92%0.0001070.0013920.0020920.0055740.02243598.19%10.86761498.714% +/- 0.031%906.00316.81
q8_0-q4_11728.0042.2%0.00131799.88%0.0001220.0024360.0035720.0089740.03470696.99%6.67526498.357% +/- 0.035%818.78346.30
q8_0-q4_01664.0040.6%0.00160699.85%0.0001180.0027930.0040840.0099690.03929996.55%8.99398698.309% +/- 0.036%908.09316.08
q5_11536.0037.5%0.00101999.91%0.0000750.0017870.0027240.0072620.02985497.46%6.70749398.618% +/- 0.032%907.45316.44
q8_0-turbo3_tcq1504.0036.7%0.00333699.68%0.0001190.0065800.0094510.0223740.08481892.25%11.13049997.411% +/- 0.044%871.88328.10
q5_01408.0034.4%0.00113599.90%0.0000880.0020280.0031130.0081130.03134897.32%8.00191398.553% +/- 0.033%908.72315.93
q5_1-q4_11408.0034.4%0.00168399.84%0.0001400.0029280.0043290.0109560.03897696.58%11.20368998.302% +/- 0.036%906.39316.62
q5_0-q4_11344.0032.8%0.00152999.86%0.0000370.0030730.0045930.0117950.04282896.21%2.32893398.227% +/- 0.036%905.93316.98
q5_1-q4_01344.0032.8%0.00181399.83%0.0001630.0032360.0047820.0117420.04289396.20%18.07721398.160% +/- 0.037%905.51317.08
q4_11280.0031.3%0.00231699.78%0.0001040.0044410.0067760.0167340.06885893.73%8.87420497.880% +/- 0.040%913.32314.50
q5_0-q4_01280.0031.3%0.00193699.82%0.0001470.0033680.0050130.0124190.04439396.06%14.36477998.125% +/- 0.037%906.57316.61
q5_1-turbo3_tcq1184.0028.9%0.00356099.65%0.0001300.0070250.0101760.0240770.08870691.89%10.15408197.304% +/- 0.045%870.78328.52
q4_01152.0028.1%0.00275999.73%0.0001410.0052190.0079500.0198620.07666393.01%10.04576497.793% +/- 0.041%914.36314.20
q5_0-turbo3_tcq1120.0027.3%0.00360099.65%0.0000990.0071980.0104850.0247520.10210990.67%9.03382097.295% +/- 0.045%869.81328.94
q5_0-turbo31104.0027.0%0.00520999.49%0.0001500.0106020.0152450.0365800.13435987.79%11.50617496.750% +/- 0.049%894.86320.42
q4_1-turbo3_tcq1056.0025.8%0.00422699.59%0.0001070.0084650.0125130.0296180.12185488.90%8.72381897.117% +/- 0.046%871.93328.15
turbo41056.0025.8%0.00298899.71%0.0001300.0058810.0088390.0218680.07636393.03%9.16818397.652% +/- 0.042%744.85379.35
q4_0-turbo3_tcq992.0024.2%0.00446699.56%0.0001230.0090970.0134700.0322880.10866290.08%9.38369697.067% +/- 0.047%871.33328.57
q4_0-turbo3976.0023.8%0.00600799.41%0.0001360.0123520.0180440.0424280.16164485.43%11.09244696.508% +/- 0.051%897.34319.56
q4_0-turbo2_tcq864.0021.1%0.01359598.66%0.0002040.0286870.0413210.0953930.36782569.51%14.22252294.742% +/- 0.062%881.14325.07
turbo3_tcq832.0020.3%0.00603899.41%0.0001270.0127140.0190580.0462190.17248084.51%9.41598596.569% +/- 0.050%845.55337.45
turbo3800.0019.5%0.00910299.10%0.0001640.0195020.0293110.0682660.23647279.27%10.84707795.746% +/- 0.056%894.11320.59
turbo3_tcq-turbo2_tcq704.0017.2%0.01446198.57%0.0001650.0310140.0453300.1045590.37485469.02%10.15083294.578% +/- 0.063%847.59336.76
turbo3-turbo2672.0016.4%0.02216897.82%0.0002710.0466980.0680080.1603270.60264954.96%18.98519193.434% +/- 0.068%884.98323.44
turbo2_tcq576.0014.1%0.02073997.96%0.0002300.0454970.0680260.1612560.53819058.62%15.58235293.669% +/- 0.067%861.17331.75
turbo2544.0013.3%0.03438096.63%0.0003400.0758760.1137340.2655350.89538541.01%19.48207991.865% +/- 0.076%901.01318.44

IQ4_XS 128k

CacheKV cache (MiB)Size vs bf16Mean KLDPrecision vs bf16KLD +/-90% KLD95% KLD99% KLD99.9% KLD99.9% precision vs bf16Maximum KLDSame top pTok/sElapsed (s)
bf168192.00100.0%0.000000100.00%0.0000000.0000150.0000230.0000370.000051100.00%0.00006799.995% +/- 0.002%702.50400.97
q8_04352.0053.1%0.00048299.95%0.0000070.0009830.0015080.0040610.01495198.52%0.47817598.950% +/- 0.028%708.31397.81
q8_0-q5_13712.0045.3%0.00065199.93%0.0000120.0013350.0020100.0051610.01891898.13%1.27021298.779% +/- 0.030%694.87405.20
q8_0-q5_03584.0043.8%0.00070399.93%0.0000130.0014330.0021410.0055230.02036097.99%1.16698698.757% +/- 0.031%702.31400.84
q8_0-q4_13456.0042.2%0.00114999.89%0.0000120.0024530.0035820.0085680.02997097.05%0.96473398.407% +/- 0.035%637.52440.27
q8_0-q4_03328.0040.6%0.00129599.87%0.0000160.0027650.0039980.0095870.03574196.49%1.61493198.304% +/- 0.036%706.17398.89
q5_13072.0037.5%0.00082799.92%0.0000080.0017920.0026870.0067640.02329197.70%0.49684698.603% +/- 0.032%702.81400.71
q8_0-turbo3_tcq3008.0036.7%0.00316799.68%0.0000290.0067910.0098600.0229350.08135092.19%2.32976497.407% +/- 0.044%672.90417.21
q5_02816.0034.4%0.00092699.91%0.0000070.0020370.0030670.0074680.02741097.30%0.42794998.543% +/- 0.033%704.01400.14
q5_1-q4_12816.0034.4%0.00133599.87%0.0000130.0028840.0042210.0100470.03506296.56%0.85043998.334% +/- 0.035%703.75400.05
q5_0-q4_12688.0032.8%0.00138799.86%0.0000130.0030170.0044370.0104110.03570696.50%0.97147698.243% +/- 0.036%702.90400.58
q5_1-q4_02688.0032.8%0.00148599.85%0.0000140.0032000.0047460.0113420.04053096.03%0.97673898.200% +/- 0.037%702.58400.94
q4_12560.0031.3%0.00193399.81%0.0000130.0043180.0065950.0160460.05091895.04%0.43583797.961% +/- 0.039%709.38397.11
q5_0-q4_02560.0031.3%0.00152999.85%0.0000160.0033160.0048680.0116400.03903396.18%1.11660698.211% +/- 0.037%704.22399.88
q5_1-turbo3_tcq2368.0028.9%0.00336099.66%0.0000290.0072290.0104960.0251040.08947491.45%2.23717097.375% +/- 0.044%670.63418.53
q4_02304.0028.1%0.00225999.77%0.0000170.0050580.0076970.0185050.05830194.34%1.07467197.793% +/- 0.041%710.51396.57
q5_0-turbo3_tcq2240.0027.3%0.00339199.66%0.0000300.0073210.0105670.0244220.09090191.32%2.25298797.384% +/- 0.044%670.54418.68
q5_0-turbo32208.0027.0%0.00472899.53%0.0000350.0103750.0147670.0341980.12134088.58%1.80996496.732% +/- 0.049%693.49405.71
q4_1-turbo3_tcq2112.0025.8%0.00398199.60%0.0000340.0086120.0127010.0301110.11281289.34%2.19368697.182% +/- 0.046%672.84417.30
turbo42112.0025.8%0.00260599.74%0.0000240.0057010.0086530.0203670.07190293.07%1.17926397.639% +/- 0.042%519.76531.96
q4_0-turbo3_tcq1984.0024.2%0.00413199.59%0.0000320.0090620.0134010.0314350.11229789.38%2.13901697.078% +/- 0.047%671.63418.13
q4_0-turbo31952.0023.8%0.00548899.45%0.0000400.0120730.0176440.0400470.14515886.49%1.54579596.549% +/- 0.050%695.78404.37
q4_0-turbo2_tcq1728.0021.1%0.01332998.68%0.0000900.0298050.0427930.0968110.30663073.60%6.77972694.713% +/- 0.062%678.10414.30
turbo3_tcq1664.0020.3%0.00570899.43%0.0000450.0127480.0194410.0453200.15014486.06%2.07951096.591% +/- 0.050%647.62432.43
turbo31600.0019.5%0.00833499.17%0.0000570.0190230.0285960.0661570.20746881.27%2.45483495.861% +/- 0.055%691.41406.63
turbo3_tcq-turbo2_tcq1408.0017.2%0.01434498.58%0.0000860.0322430.0467370.1052690.34395170.90%4.01086694.530% +/- 0.063%648.06432.26
turbo3-turbo21344.0016.4%0.02046897.97%0.0001540.0455140.0659760.1477450.47441562.23%11.93838793.540% +/- 0.068%686.57409.17
turbo2_tcq1152.0014.1%0.01985798.03%0.0001220.0454910.0683980.1587660.45376163.53%4.37008593.807% +/- 0.067%656.84426.67
turbo21088.0013.3%0.03263196.79%0.0002030.0738330.1117650.2616070.83811343.25%4.73564292.033% +/- 0.075%698.21402.96

12. q6_0 Follow-Up Benchmarks

Performed using pre-release v0.3.0 of BeeLlama that added the q6_0 cache mode. New modes tested: symmetric q6_0, asymmetric q6_0 pairs with q5_1, q5_0, q4_1, q4_0, turbo4, and turbo3_tcq V; plus q8_0-q6_0, q8_0-turbo4, and q5_0-turbo4. The KL tables also include re-runs of existing q8_0, q5_0, q4_0, and turbo3_tcq as a control group to verify stability after a build change. All control results matched within noise.

12.1 Perplexity

Q5_K_S 64k

CacheKV cache (MiB)Size vs bf16Median PPLPrecision vs bf16PPL +/-Same top pTok/sElapsed (s)
q6_01664.0040.6%5.4778100.04%0.0346597.890% +/- 0.040%852.96336.74

IQ4_XS 64k

CacheKV cache (MiB)Size vs bf16Median PPLPrecision vs bf16PPL +/-Same top pTok/sElapsed (s)
q6_01664.0040.6%5.5171100.00%0.0350098.878% +/- 0.029%922.50311.45

IQ4_XS 128k

CacheKV cache (MiB)Size vs bf16Median PPLPrecision vs bf16PPL +/-Same top pTok/sElapsed (s)
q6_03328.0040.6%5.272999.99%0.0327298.855% +/- 0.029%720.53390.94

12.2 KL Divergence

Same format as §11.2. Sorted by KV cache size descending. Includes re-run control rows for staple symmetric pairs (q8_0, q5_0, q4_0, turbo3_tcq) alongside the new entries.

Q5_K_S 64k

CacheKV cache (MiB)Size vs bf16Mean KLDPrecision vs bf16KLD +/-90% KLD95% KLD99% KLD99.9% KLD99.9% precision vs bf16Maximum KLDSame top pTok/sElapsed (s)
q8_02176.0053.1%0.00235899.80%0.0001390.0041790.0065500.0185690.07854894.62%13.84948497.935% +/- 0.039%849.29338.22
q8_0-q6_01920.0046.9%0.00249999.79%0.0001840.0042950.0067080.0193810.08161694.33%17.81619697.942% +/- 0.039%848.78338.40
q6_01664.0040.6%0.00261499.78%0.0001800.0044260.0069490.0200780.09080093.47%14.11258697.890% +/- 0.040%845.96339.52
q8_0-turbo41616.0039.5%0.00356199.68%0.0002150.0065180.0098340.0264260.10304192.33%23.10272497.460% +/- 0.043%838.90342.38
q6_0-q5_11600.0039.1%0.00278199.76%0.0002280.0046820.0073480.0209980.09044793.50%23.77049197.913% +/- 0.039%846.24339.41
q6_0-q5_01536.0037.5%0.00282099.76%0.0002090.0047480.0074570.0218830.09268293.29%23.18686797.788% +/- 0.041%846.86339.16
q6_0-q4_11472.0035.9%0.00331299.71%0.0002320.0057550.0088470.0243870.10458292.19%23.24465997.605% +/- 0.042%848.42338.54
q5_01408.0034.4%0.00282699.76%0.0001390.0052380.0081820.0227040.09414793.16%14.57281797.717% +/- 0.041%846.21339.45
q6_0-q4_01408.0034.4%0.00328899.71%0.0001290.0060960.0092940.0254560.11156691.55%10.71110097.524% +/- 0.043%848.24338.61
q6_0-turbo41360.0033.2%0.00374899.66%0.0002240.0066420.0099970.0269020.10737791.93%16.44510397.465% +/- 0.043%837.77342.84
q6_0-turbo3_tcq1248.0030.5%0.00537999.50%0.0002470.0099060.0145560.0372850.15468087.68%19.73954896.922% +/- 0.048%819.23350.60
q5_0-turbo41232.0030.1%0.00381299.66%0.0001760.0070680.0107350.0282030.11224991.49%17.03202497.371% +/- 0.044%837.52342.95
q4_01152.0028.1%0.00471799.57%0.0002570.0084450.0129210.0343040.14128888.87%17.41537997.117% +/- 0.046%850.09337.90
turbo3_tcq832.0020.3%0.00797899.24%0.0002670.0156630.0236280.0582860.22710481.56%20.51747196.265% +/- 0.052%795.23361.21

IQ4_XS 64k

CacheKV cache (MiB)Size vs bf16Mean KLDPrecision vs bf16KLD +/-90% KLD95% KLD99% KLD99.9% KLD99.9% precision vs bf16Maximum KLDSame top pTok/sElapsed (s)
q8_02176.0053.1%0.00068599.94%0.0001110.0009380.0014360.0040740.01698398.73%10.87834198.975% +/- 0.028%909.48315.96
q8_0-q6_01920.0046.9%0.00065999.94%0.0000930.0010320.0015780.0044190.01867098.56%10.62567298.906% +/- 0.029%908.70316.18
q6_01664.0040.6%0.00076699.93%0.0001090.0011790.0017910.0047620.02040798.39%11.36899598.878% +/- 0.029%906.47316.96
q8_0-turbo41616.0039.5%0.00184599.83%0.0001080.0033110.0047870.0115510.04612495.89%8.48830998.147% +/- 0.037%898.63319.72
q6_0-q5_11600.0039.1%0.00088299.92%0.0000990.0014310.0021690.0057460.02196898.23%10.72814898.772% +/- 0.030%906.67316.89
q6_0-q5_01536.0037.5%0.00093399.92%0.0001030.0015190.0022690.0060440.02358898.08%10.47576698.666% +/- 0.032%906.68316.89
q6_0-q4_11472.0035.9%0.00148899.87%0.0001150.0025930.0038300.0097630.03758196.71%10.88983598.378% +/- 0.035%908.06316.40
q5_01408.0034.4%0.00128499.88%0.0001310.0020390.0031250.0082730.03238797.22%12.83606998.526% +/- 0.033%906.42317.02
q6_0-q4_01408.0034.4%0.00155599.85%0.0001130.0028930.0042000.0103920.03960196.52%11.93431098.279% +/- 0.036%909.10316.04
q6_0-turbo41360.0033.2%0.00193399.82%0.0001180.0034450.0050010.0120210.04472296.02%8.84209798.076% +/- 0.038%896.11320.62
q6_0-turbo3_tcq1248.0030.5%0.00341299.67%0.0001210.0067020.0096420.0228860.08987491.78%9.69500397.394% +/- 0.044%874.99328.36
q5_0-turbo41232.0030.1%0.00212299.80%0.0001310.0039130.0057690.0139950.05231595.30%11.28628997.977% +/- 0.039%895.90320.70
q4_01152.0028.1%0.00274899.74%0.0001410.0052440.0080110.0199100.06937393.69%12.89053997.777% +/- 0.041%911.14315.38
turbo3_tcq832.0020.3%0.00603899.41%0.0001270.0127140.0190580.0462190.17248084.51%9.41598596.569% +/- 0.050%847.47339.08

IQ4_XS 128k

CacheKV cache (MiB)Size vs bf16Mean KLDPrecision vs bf16KLD +/-90% KLD95% KLD99% KLD99.9% KLD99.9% precision vs bf16Maximum KLDSame top pTok/sElapsed (s)
q8_04352.0053.1%0.00047899.95%0.0000060.0009780.0014930.0039680.01527798.49%0.47923498.959% +/- 0.028%705.23399.42
q8_0-q6_03840.0046.9%0.00053099.95%0.0000060.0010930.0016520.0043350.01740898.28%0.46899898.900% +/- 0.029%703.96400.14
q6_03328.0040.6%0.00058999.94%0.0000080.0012120.0018520.0047060.01917598.11%0.57320098.855% +/- 0.029%701.36401.62
q8_0-turbo43232.0039.5%0.00155499.84%0.0000140.0033350.0048100.0114730.04100695.99%0.63325298.171% +/- 0.037%694.04405.86
q6_0-q5_13200.0039.1%0.00070399.93%0.0000110.0014520.0021800.0055440.02182197.85%1.14590598.740% +/- 0.031%701.05401.80
q6_0-q5_03072.0037.5%0.00075299.92%0.0000130.0015520.0023110.0058720.02222797.81%1.08344598.700% +/- 0.031%701.69401.43
q6_0-q4_12944.0035.9%0.00119199.88%0.0000090.0025830.0037910.0088750.03103296.95%0.43163098.412% +/- 0.035%703.37400.48
q5_02816.0034.4%0.00092899.91%0.0000090.0020090.0030510.0075950.02694097.35%0.49504398.533% +/- 0.033%700.96401.85
q6_0-q4_02816.0034.4%0.00131799.87%0.0000100.0028750.0041830.0097840.03186396.87%0.50489998.275% +/- 0.036%704.62399.76
q6_0-turbo42720.0033.2%0.00159099.84%0.0000160.0034350.0049790.0118590.03900196.18%1.39916198.063% +/- 0.038%691.93407.09
q6_0-turbo3_tcq2496.0030.5%0.00323899.68%0.0000310.0069220.0099330.0232710.08734191.64%2.34683297.388% +/- 0.044%673.13418.46
q5_0-turbo42464.0030.1%0.00180999.82%0.0000230.0038910.0057420.0134250.04732895.38%2.02532398.004% +/- 0.039%691.62407.28
q4_02304.0028.1%0.00226499.77%0.0000150.0050740.0077430.0189330.05992794.19%0.45571597.803% +/- 0.040%707.36398.21
turbo3_tcq1664.0020.3%0.00570899.43%0.0000450.0127480.0194410.0453200.15014486.06%2.07951096.591% +/- 0.050%648.81434.15

13. q6_0 Follow-Up Analysis

The q6_0 follow-up does not change the shape of the KV-cache ladder. It changes two rungs: q8_0 / q6_0 becomes the better high-end preset, and q6_0 / q5_0 gets added as an optional step between normal q5 and q8 K. Everything else in the q6 sweep is mostly useful as a guardrail: it shows where the extra K bits stop helping, where V gets too weak, and why turbo4 still stays out of the normal recommendations.

13.1 The q6 Slot Is q6 K + q5 V

Symmetric q6_0 is a useful measurement row, but it is not the row the ladder wants. On Q5_K_S 64k, it scores 93.47% at the 99.9% tail, only 0.31 points above the same-run q5_0 control at 93.16%. It costs 40.6% of bf16 KV, so the extra size does not buy much on that run.

The better trade is q6_0 / q5_0. It keeps most of the q6 K benefit, drops the footprint to 37.5%, and is the first q6_0 row that makes sense as an actual preset. But on Q5_K_S the gain over q5_0 / q5_0 is almost a tie: 93.29% vs 93.16% at the 99.9% tail, with mean KLD also nearly tied, 0.002820 vs 0.002826. From this data the benefits are not obvious.

The reason to keep it is the lower-weight-precision run. On IQ4_XS, q6_0 / q5_0 opens a clearer gap over q5_0 / q5_0: 98.08% vs 97.22% at 64k, and 97.81% vs 97.35% at 128k. This does not prove a full mechanism for why the gap changes, but it is enough for a recommendation: if the weights are already compressed and you want more KV headroom than q5 gives, q6 K is the next useful spend.

K / V Size vs bf16 Q5_K_S 64k 99.9% IQ4_XS 64k 99.9% IQ4_XS 128k 99.9% Read
q8_0 / q8_0 53.1% 94.62% 98.73% 98.49% Fidelity/validation tier
q8_0 / q6_0 46.9% 94.33% 98.56% 98.28% New high-end recommendation
q6_0 / q6_0 40.6% 93.47% 98.39% 98.11% Measurement row, not worth a ladder slot
q6_0 / q5_0 37.5% 93.29% 98.08% 97.81% New optional headroom preset
q5_0 / q5_0 34.4% 93.16% 97.22% 97.35% Still the normal quality preset

13.2 The q5 V Range Is Cheap, But Not Free

The easiest trap in this follow-up is the Q5_K_S tail wobble around q5 V. Symmetric q6_0 scores 93.47%, q6_0 / q5_1 scores 93.50%, and q6_0 / q5_0 scores 93.29%. The 0.03-point "gain" from dropping V to q5_1 is not a result to build a recommendation around. Mean KLD moves the other way: 0.002614 for symmetric q6_0, 0.002781 with q5_1 V, and 0.002820 with q5_0 V.

IQ4_XS makes it less ambiguous. At 64k, the 99.9% tail goes 98.39% → 98.23% → 98.08% as V moves from q6_0 to q5_1 to q5_0. At 128k, it goes 98.11% → 97.85% → 97.81%. So the q5 V range is still a controlled loss. It just happens to be small enough that one Q5 tail column can wobble by more than the real difference between adjacent q5 V tiers.

The useful boundary is below q5. q6_0 / q4_1 drops to 92.19% on Q5_K_S, 1.10 points below q6_0 / q5_0. On IQ4_XS, the same step costs 1.37 points at 64k and 0.86 points at 128k. q6_0 / q4_0 is even easier to reject: it has the same 34.4% footprint as q5_0 / q5_0, but loses on all three follow-up configs.

K / V Size vs bf16 Q5_K_S 64k 99.9% IQ4_XS 64k 99.9% IQ4_XS 128k 99.9% Decision
q6_0 / q5_1 39.1% 93.50% 98.23% 97.85% Too close to q6_0 / q6_0 for the size
q6_0 / q5_0 37.5% 93.29% 98.08% 97.81% Keep: best q6 K trade-off
q6_0 / q4_1 35.9% 92.19% 96.71% 96.95% Reject: crosses the V cliff
q6_0 / q4_0 34.4% 91.55% 96.52% 96.87% Reject: same size as q5_0 / q5_0, worse result

13.3 turbo4 Still Loses the Tie-Breaker

turbo4 is still not a normal V tier. The confusing row is q6_0 / turbo4: on Q5_K_S, it scores 91.93% at the 99.9% tail, slightly ahead of q6_0 / q4_0 at 91.55%. If that were the only column, it would look like turbo4 V had beaten q4_0 V.

The rest of the row rejects that reading. Mean KLD is worse with turbo4, 0.003748 vs 0.003288, and the IQ4_XS runs move in the same direction at the tail: 96.02% vs 96.52% at 64k, then 96.18% vs 96.87% at 128k. This is the kind of case where the 99.9% tail is still the right warning metric, but mean KLD stops one lucky tail from becoming a recommendation.

The q8 and q5 turbo4 pairs tell the same story. q8_0 / turbo4 gets a tiny Q5_K_S tail edge over q8_0 / q4_0 from the older table, 92.33% vs 92.18%, but loses on IQ4_XS. q5_0 / turbo4 is smaller than q5_0 / q4_0, but the q4 V row is the better recommendation whenever that extra 1.2% of bf16 KV fits.

turbo3_tcq V is below the precision floor for these mixed presets. At q6_0 K, q6_0 / turbo3_tcq lands at 87.68% on Q5_K_S, below the same-run symmetric q4_0 control at 88.87%. That does not remove symmetric turbo3_tcq from the extreme-compression ladder, but it does mean turbo V should not be mixed into ordinary q5/q6/q8 precision rows.

13.4 Updated Preset Ladder

The preset ladder only needs two edits. At the top, q8_0 / q6_0 replaces q8_0 / q5_1. The Q5_K_S gain is small, 94.33% vs 94.21% at the 99.9% tail, but the size cost is also small: 46.9% vs 45.3% of bf16 KV. At this tier, paying 1.6 points of KV size for the stronger V side is cleaner than keeping q5_1 as the main fidelity preset.

In the middle, q6_0 / q5_0 is added as an optional headroom row. It is not a new default. If q5_0 / q5_0 already gives enough quality, stay there. Use q6_0 / q5_0 when you want a stronger K side and cannot justify the jump to q8_0 K.

K / V % of bf16 size 99.9% precision What it is for
bf16 / bf16 100.0 100.00% Preserving full quality
q8_0 / q8_0 53.1 94.62% Validation and blame-isolation mode
q8_0 / q6_0 46.9 94.33% Recommended high-end preset
q8_0 / q5_1 45.3 94.21% Fallback if q6_0 V is unavailable
q8_0 / q5_0 43.8 93.69% If the high-end rows miss the fit by a narrow margin
q6_0 / q5_0 37.5 93.29% Optional headroom tier between q5 and q8 K
q5_0 / q5_0 34.4 93.16% Normal quality preset
q5_0 / q4_1 32.8 92.65% Best default if VRAM-constrained
q5_0 / q4_0 31.3 91.39% If q5_0 / q4_1 misses the fit by a narrow margin
q4_0 / q4_0 28.1 88.87% Memory saving with visible precision loss
q4_0 / turbo3_tcq 24.2 84.93% Smaller than q4, cleaner than symmetric turbo3_tcq
turbo3_tcq / turbo3_tcq 20.3 81.56% Viable extreme-compression mode
turbo2_tcq / turbo2_tcq 14.1 54.38% Last resort: not for code, JSON, math, or tool calls

The rejected q6 rows are not close calls. q6_0 / q5_1 saves too little compared with symmetric q6 and is larger than the useful q6_0 / q5_0 row. q6_0 / q4_1 crosses the V cliff. q6_0 / q4_0 has the same size as q5_0 / q5_0 and loses to it. q6_0 / turbo4 has one attractive Q5_K_S tail ordering, but mean KLD and both IQ4_XS runs reject it. q6_0 / turbo3_tcq belongs below the precision floor, not in the normal ladder.

The preset rule is simple: use q8_0 / q6_0 when you want the high-end cache tier, use q6_0 / q5_0 only when q5_0 / q5_0 feels too tight, and do not chase sub-q5 V just because K is stronger. The q6 follow-up adds one useful middle rung, not a new philosophy for the whole ladder.


14. bf16-K Follow-Up Benchmarks

This follow-up keeps K at bf16 and walks V through the same cache tiers used elsewhere in the article. The point is narrow: these rows isolate how much damage comes from V quantization when the K side is left at the reference format.

Same format as §11.2. KL precision and 99.9% precision are measured against the bf16 / bf16 baseline for each weight/context configuration. Rows are sorted by total KV cache size descending.

Q5_K_S 64k

CacheKV cache (MiB)Size vs bf16Mean KLDPrecision vs bf16KLD +/-90% KLD95% KLD99% KLD99.9% KLD99.9% precision vs bf16Maximum KLDSame top pTok/sElapsed (s)
bf164096.00100.0%0.000375100.00%0.0000580.0005680.0016930.0052340.023258100.00%7.37404699.647% +/- 0.016%839.21342.35
bf16-q8_03136.0076.6%0.00247599.79%0.0001710.0042090.0066290.0192750.07982794.50%14.72976597.991% +/- 0.039%850.62337.75
bf16-q6_02880.0070.3%0.00239399.80%0.0001010.0043020.0067880.0194160.07885394.59%6.56477097.961% +/- 0.039%848.99338.40
bf16-q5_12816.0068.8%0.00244099.79%0.0000900.0045670.0071710.0200560.09680592.91%9.26014197.903% +/- 0.040%848.20338.72
bf16-q5_02752.0067.2%0.00249199.79%0.0001280.0045990.0071100.0206640.08975393.57%13.07955397.829% +/- 0.040%848.53338.58
bf16-q4_12688.0065.6%0.00322199.72%0.0002080.0056610.0086820.0234040.10592392.07%20.04471697.637% +/- 0.042%853.30336.69
bf16-q4_02624.0064.1%0.00333999.70%0.0001870.0059520.0091360.0254300.11016691.68%19.45288197.594% +/- 0.042%854.17336.35
bf16-turbo42576.0062.9%0.00354999.68%0.0001630.0065320.0098490.0265390.10105492.52%11.93411797.473% +/- 0.043%843.17340.74
bf16-turbo3_tcq2464.0060.2%0.00537799.50%0.0002810.0098150.0145090.0368720.14617588.43%23.30722296.962% +/- 0.047%823.83348.74

IQ4_XS 64k

CacheKV cache (MiB)Size vs bf16Mean KLDPrecision vs bf16KLD +/-90% KLD95% KLD99% KLD99.9% KLD99.9% precision vs bf16Maximum KLDSame top pTok/sElapsed (s)
bf164096.00100.0%0.000097100.00%0.0000200.0001860.0003980.0010620.004152100.00%2.34505699.776% +/- 0.013%910.62315.50
bf16-q8_03136.0076.6%0.00058799.95%0.0000920.0009240.0014080.0040180.01614898.81%11.61589298.952% +/- 0.028%909.80315.78
bf16-q6_02880.0070.3%0.00063699.95%0.0000790.0010290.0015560.0041480.01700598.72%8.44026698.875% +/- 0.029%908.22316.33
bf16-q5_12816.0068.8%0.00075799.93%0.0000740.0012830.0019210.0050450.02020698.41%8.46264998.798% +/- 0.030%907.39316.62
bf16-q5_02752.0067.2%0.00087399.92%0.0001010.0013840.0020880.0056880.02239498.19%10.60526798.764% +/- 0.031%908.46316.25
bf16-q4_12688.0065.6%0.00135799.87%0.0000950.0024550.0036070.0089770.03474196.99%10.62426898.392% +/- 0.035%913.53314.50
bf16-q4_02624.0064.1%0.00145999.86%0.0000730.0027560.0040210.0101210.03967696.51%6.61931398.367% +/- 0.035%914.07314.31
bf16-turbo42576.0062.9%0.00177099.83%0.0000970.0032860.0047410.0116790.04312796.18%8.78224898.144% +/- 0.037%901.27318.77
bf16-turbo3_tcq2464.0060.2%0.00325699.68%0.0000920.0065740.0094630.0222780.08444892.28%8.79170297.398% +/- 0.044%879.05326.83

IQ4_XS 128k

CacheKV cache (MiB)Size vs bf16Mean KLDPrecision vs bf16KLD +/-90% KLD95% KLD99% KLD99.9% KLD99.9% precision vs bf16Maximum KLDSame top pTok/sElapsed (s)
bf168192.00100.0%0.000000100.00%0.0000000.0000150.0000230.0000370.000051100.00%0.00006799.995% +/- 0.002%705.78399.10
bf16-q8_06272.0076.6%0.00047999.95%0.0000110.0009710.0014730.0039850.01495798.52%1.25558398.953% +/- 0.028%705.73399.13
bf16-q6_05760.0070.3%0.00052899.95%0.0000110.0010800.0016400.0042610.01691598.33%1.24646498.920% +/- 0.029%703.90400.17
bf16-q5_15632.0068.8%0.00064399.94%0.0000070.0013180.0019810.0051920.01960498.06%0.45965398.803% +/- 0.030%703.75400.25
bf16-q5_05504.0067.2%0.00068099.93%0.0000070.0014100.0020940.0053330.02099597.93%0.44867398.758% +/- 0.031%703.91400.16
bf16-q4_15376.0065.6%0.00114599.89%0.0000110.0024390.0035580.0083890.03061796.99%0.85612498.425% +/- 0.034%709.30397.13
bf16-q4_05248.0064.1%0.00129599.87%0.0000160.0027730.0040270.0094340.03489796.58%1.48167098.347% +/- 0.035%710.30396.57
bf16-turbo45152.0062.9%0.00153399.85%0.0000130.0033040.0047790.0112210.04151095.94%0.50520798.119% +/- 0.038%697.20404.02
bf16-turbo3_tcq4928.0060.2%0.00319199.68%0.0000270.0068560.0098580.0231020.08619691.75%1.07173297.420% +/- 0.044%678.44415.19

15. bf16-K Follow-Up Analysis

15.1 What The bf16-K Rows Test

The useful part of this follow-up is not a new preset ladder, but a V-side isolation check. These pairs keep K at the reference format and make V do all the compression work, so they answer a cleaner question: once attention scores have the strongest K side available in this benchmark, where does V start to become the limiting error?

That also makes these rows awkward as recommendations. A bf16 K side is expensive: at 64k it already takes 2048 MiB before V is counted. Even the smallest row in this set, bf16 / turbo3_tcq, still uses 60.2% of the full bf16 cache. So the entire point is to see which V tiers still matter after K has stopped being the bottleneck.

15.2 q8 V Does Not Earn Full K

bf16 / q8_0 is the cleanest non-recommendation in the table. On Q5_K_S 64k, it scores 94.50% at the 99.9% tail, which is the same broad quality band as symmetric q8_0 / q8_0 at 94.62% and q8_0 / q6_0 at 94.33%. The size is the problem: bf16 / q8_0 uses 76.6% of the full cache, while q8_0 / q8_0 uses 53.1% and q8_0 / q6_0 uses 46.9%.

K / V Size vs bf16 Q5_K_S 64k 99.9% Read
bf16 / q8_0 76.6% 94.50% Diagnostic row, not a preset
q8_0 / q8_0 53.1% 94.62% Same tail band at much lower size
q8_0 / q6_0 46.9% 94.33% Recommended high-end preset from §13

The IQ4_XS rows do not rescue the trade. bf16 / q8_0 scores 98.81% at 64k and 98.52% at 128k, but it still costs 76.6% of the full cache. If q8 K is already enough to reach the high-end tail band, keeping K at 16 bits mostly buys comfort for the benchmarker, not a useful user-facing preset.

15.3 q6 V Is The High-End Ceiling

The q6 row shows where the V side stops needing more precision. On Q5_K_S, bf16 / q6_0 scores 94.59%, effectively tied with bf16 / q8_0 at 94.50%. On IQ4_XS, dropping V from q8 to q6 costs 0.09 points at 64k and 0.19 points at 128k. That is small enough to treat q6 V as the ceiling for the practical high-end row.

K / V Size vs bf16 Q5_K_S 64k IQ4_XS 64k IQ4_XS 128k
bf16 / q8_0 76.6% 94.50% 98.81% 98.52%
bf16 / q6_0 70.3% 94.59% 98.72% 98.33%
q8_0 / q6_0 46.9% 94.33% 98.56% 98.28%

The comparison that matters is bf16 / q6_0 against q8_0 / q6_0. The full-K row is 70.3% of bf16 KV, while the q8-K row is 46.9%. On Q5_K_S, that extra 23.4 points of cache size buys 0.26 points of tail precision. On IQ4_XS, it buys 0.16 points at 64k and 0.05 at 128k. That is a validation margin, not a preset margin.

15.4 q5 V Is The Practical Middle

The q5 rows are where V quantization becomes visible without falling out of the normal quality range. On Q5_K_S, bf16 / q5_0 scores 93.57%, one point below bf16 / q6_0. On IQ4_XS, the same drop is smaller: 98.19% at 64k and 97.93% at 128k. That repeats the earlier pattern from §9 and §13: lower-weight-precision runs are less sensitive to same quants.

K / V Size vs bf16 Q5_K_S 64k IQ4_XS 64k IQ4_XS 128k
bf16 / q6_0 70.3% 94.59% 98.72% 98.33%
bf16 / q5_0 67.2% 93.57% 98.19% 97.93%
q6_0 / q5_0 37.5% 93.29% 98.08% 97.81%
q5_0 / q5_0 34.4% 93.16% 97.22% 97.35%

The full-K q5 rows are still too large to recommend directly. bf16 / q5_0 uses 67.2% of the cache, while q6_0 / q5_0 uses 37.5% and lands in the same practical band. The q5 V tier is useful, but with q6 or q5 K, not 16-bit K.

15.5 q4 V Is The Drop Full K Cannot Hide

q4 V shows the limit of the full-K diagnostic. On Q5_K_S, bf16 / q4_0 scores 91.68% and bf16 / q4_1 scores 92.07%. Those rows are larger than every normal q8/q6/q5 preset in the ladder, but they still sit below the q5 V range. Once V falls to q4, reference-format K can soften the loss but cannot make the row high end.

The IQ4_XS rows are less harsh but keep the same boundary. At 128k, bf16 / q5_0 is 97.93%, bf16 / q4_1 is 96.99%, and bf16 / q4_0 is 96.58%. That is not a collapse, but it is a clear step down at the same q5-to-q4 V boundary that showed up in the earlier asymmetric tables.

15.6 turbo V Stays Below The Normal Ladder

turbo4 looks better with bf16 K than it looked as a symmetric preset, but that is not enough to make it a normal recommendation. On Q5_K_S, bf16 / turbo4 scores 92.52%, above bf16 / q4_0 and bf16 / q4_1. The price is still 62.9% of the full cache, and the earlier sections already showed why turbo4 loses once other factors are included.

bf16 / turbo3_tcq is clearer. It scores 88.43% on Q5_K_S, 92.28% on IQ4_XS 64k, and 91.75% on IQ4_XS 128k. Full K helps compared with fully compressed turbo rows, but it does not turn turbo V into a q5/q6-quality tier. Turbo modes still belong in the extreme-compression part of the ladder, not in the ordinary precision presets.

15.7 What This Changes

This follow-up does not add any bf16 / V presets and stays exclusively diagnostic. Holding K at bf16 shows the V-side ceiling, but it also makes the rows too large to compete with the q8/q6/q5 ladder.

Observation from bf16-K rows Preset implication
bf16 / q8_0 is q8-like in quality but 76.6% of bf16 KV Do not use it as a preset; q8_0 / q8_0 or q8_0 / q6_0 is cleaner
bf16 / q6_0 is basically tied with bf16 / q8_0 Keep q8_0 / q6_0 as the high-end recommendation
bf16 / q5_0 lands close to the q6/q5 rows but at much larger size Keep q6_0 / q5_0 as optional headroom and q5_0 / q5_0 as the normal quality preset
q4 and turbo V remain visibly lower even with reference-format K Keep them below the normal ladder as memory-saving or extreme-compression options

V has a ceiling around q6, q5 is the practical middle, and below q5 the K side cannot fully bail the row out. That strengthens the section 13 ladder: spend the high-end budget on q8_0 / q6_0, not on bf16 / q8_0 or bf16 / q6_0.


16. bf16 vs f16 Follow-Up Benchmarks

This follow-up runs a narrow Q5_K_S 32k Wikitext path using f32 KL baseline. The point is to answer the smaller question left open by the earlier bf16-baseline tables: which 16-bit KV format between f16 and bf16 stays closer to f32?

The PPL rows compare symmetric f32, f16, and bf16 cache. The KL rows compare those same symmetric rows plus two diagnostic f32-K rows, f32 / f16 and f32 / bf16, which isolate the V side while keeping K at the reference format. Sizes are for total K+V cache at 32k context.

16.1 Perplexity

Q5_K_S 32k

CacheKV cache (MiB)Size vs f32Median PPLPrecision vs f32PPL +/-Same top pTok/sElapsed (s)
f32 / f324096.00100.0%6.2211100.00%0.0398699.997% +/- 0.001%690.79415.90
bf16 / bf162048.0050.0%6.2147100.10%0.0397597.736% +/- 0.039%762.98376.55
f16 / f162048.0050.0%6.238499.72%0.0399997.710% +/- 0.039%754.04381.02

16.2 KL Divergence

Same idea as §11.2, but measured against a f32 baseline instead of a bf16 baseline. Since the reference row has mean KLD of zero, precision vs f32 is 100 * exp(-meanKLD). Rows are sorted by total KV cache size descending.

Q5_K_S 32k

CacheKV cache (MiB)Size vs f32Mean KLDPrecision vs f32KLD +/-90% KLD95% KLD99% KLD99.9% KLDMaximum KLDSame top pTok/sElapsed (s)
f32 / f324096.00100.0%0.000000100.00%0.0000000.0000160.0000230.0000370.0000510.00007099.997% +/- 0.001%697.71411.77
f32 / bf163072.0075.0%0.01211198.80%0.0009530.0045310.0073920.0268601.54492629.95496097.827% +/- 0.038%726.38395.52
f32 / f163072.0075.0%0.01502298.51%0.0011100.0049610.0081990.0304322.22495130.62584197.735% +/- 0.039%737.03389.81
bf16 / bf162048.0050.0%0.01396498.61%0.0010650.0045160.0074760.0275451.79839532.00971697.736% +/- 0.039%755.24380.41
f16 / f162048.0050.0%0.01450498.56%0.0010860.0049490.0081850.0294292.31472929.69830397.710% +/- 0.039%766.66374.74

17. bf16 vs f16 Follow-Up Analysis

17.1 Why The f32 Baseline Matters

The earlier tables used bf16 as the KL reference because it was the practical uncompressed cache format. That was fine for judging q8, q6, q5, q4, and turbo rows, but it made the f16 vs bf16 question awkward: if bf16 is the baseline, then bf16 automatically gets the cleanest comparison. The f32 follow-up removes that tilt: both 16-bit formats are measured as deviations from the same f32 logit file.

17.2 bf16 Beats f16 At The Same Size

The practical symmetric comparison is direct. bf16 / bf16 and f16 / f16 both use 2048 MiB at 32k context, exactly half of the f32 / f32 cache. On PPL, bf16 lands at 6.2147 while f16 lands at 6.2384. On KL, bf16 has lower mean KLD, 0.013964 vs 0.014504, and a much better 99.9% tail, 1.798395 vs 2.314729.

The V-side isolation rows point the same way. With K held at f32, f32 / bf16 gets mean KLD 0.012111 and 99.9% KLD 1.544926. f32 / f16 gets mean KLD 0.015022 and 99.9% KLD 2.224951. That rules out a softer reading where symmetric bf16 merely got lucky as a whole-cache row, as on this benchmark bf16 V is also cleaner than f16 V.

17.3 Precision Wins, Speed Is Mostly A Wash

There is no useful VRAM trade-off between f16 and bf16. Both are 16-bit cache formats, both use the same 2048 MiB at 32k, and both cut the f32 cache in half. If the hardware supports bf16 properly, the precision result is the deciding factor.

The elapsed times do not give f16 a strong counterargument. In the PPL run, bf16 / bf16 finished in 376.55 seconds and f16 / f16 in 381.02 seconds. In the KL run, bf16 / bf16 took 380.41 seconds and f16 / f16 took 374.74 seconds. The mixed rows show the same small wobble: f32 / bf16 took 395.52 seconds and f32 / f16 took 389.81 seconds. That is not enough to justify taking the worse KLD row as a speed preset.

17.4 What This Changes

Observation from f32-baseline rows KV cache implication
bf16 / bf16 beats f16 / f16 on PPL, mean KLD, and 99.9% KLD at the same cache size Use bf16 as the normal 16-bit KV format when the backend supports it
f32 / bf16 also beats f32 / f16 The bf16 edge is not only a whole-cache artifact, it shows up when V is isolated too
f16 has no VRAM advantage over bf16 and no stable elapsed-time win in this run Do not prefer f16 for KV cache unless a specific backend handles bf16 poorly

This resolves the practical f16 vs bf16 KV-cache question for this benchmark: bf16 is the better 16-bit default. It has the same memory footprint as f16, no clear speed penalty in these runs, and lower divergence from the f32 baseline.

Back to articles