Home › Articles › KV Cache Quantization Benchmarks for Long Context

KV Cache Quantization Benchmarks for Long Context

Tests on Qwen 3.6 27B show why TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM.

The Setup
PPL Hides the Tail, KLD Exposes It
Rotation Changed the Playing Field
What TCQ Adds
Asymmetric KV: Where the Bits Should Go
TurboQuant's Narrowed Niche
The q8 Fidelity Tier
The Preset Ladder
Weight Quant Changes the Cost of Cache Quant
What Else Was Tried
Benchmark Data
1. Perplexity
2. KL Divergence
q6_0 Follow-Up Benchmarks
1. Perplexity
2. KL Divergence
q6_0 Follow-Up Analysis
bf16-K Follow-Up Benchmarks
bf16-K Follow-Up Analysis
bf16 vs f16 Follow-Up Benchmarks
1. Perplexity
2. KL Divergence
bf16 vs f16 Follow-Up Analysis

Disclaimer: this is just results of a few basic KV cache benchmarks, not the whole truth. Longer context should make the differences more pronounced, but that still doesn't guarantee 1:1 with real usage. Maybe TurboQuant magic works better with actually getting tool calls correctly at extra long context in agentic coding, or maybe it's the other way around. I tried some other tests too, but in terms of meaningful results PPL and KLD are all I've got.

1. The Setup

This benchmark started with a narrow question: does TurboQuant-style KV cache compression have any defensible niche in BeeLlama's (my llama.cpp fork) recommendation set, or is q4/q5 strictly better where it fits?

Hardware: one RTX 3090 (24 GB VRAM), Ryzen 7 5700X3D, 32 GB RAM. Model: Qwen 3.6 27B. Q5_K_S weights plus a 64k bf16 KV cache fit on the card; 128k does not, the weights alone fill too much VRAM. IQ4_XS weights are smaller, so IQ4_XS plus a 128k bf16 KV cache fits. Each context length is capped at the maximum where bf16 KV still runs, because bf16 is the reference every quantized mode is measured against. IQ4_XS at 64k is also included to compare directly against Q5_K_S at the same 64k context, isolating the effect of weight quant on cache quality (§9).

The tok/s column in the benchmark data tables is prefill throughput (batch size 2048, ubatch 256), not generation speed. GPU power was lowered during the runs. These numbers are useful only for relative comparison between cache modes on the same hardware, the absolute values do not represent real-world generation performance.

The tests ran on BeeLlama.cpp v0.1.2, which keeps the normal llama.cpp flow but adds DFlash plus TurboQuant/TCQ KV-cache compression. The TurboQuant implementation being evaluated comes from TheTom's llama-cpp-turboquant, while the TCQ variants come from buun's llama.cpp fork and the accompanying TCQ paper and codebooks.

2. PPL Hides the Tail, KLD Exposes It

The PPL numbers look flat across the board. Even turbo modes barely move the average:

Cache	Size vs bf16	Q5_K_S 64k	IQ4_XS 64k	IQ4_XS 128k
bf16	100%	5.4800	5.5169	5.2724
q8_0	53.1%	5.4774	5.5157	5.2716
q5_1	37.5%	5.4777	5.5181	5.2723
q5_0	34.4%	5.4802	5.5175	5.2738
q4_1	31.3%	5.4808	5.5237	5.2772
q4_0	28.1%	5.4877	5.5251	5.2803
turbo4	25.8%	5.4841	5.5277	5.2822
turbo3_tcq	20.3%	5.5054	5.5426	5.2985
turbo3	19.5%	5.5149	5.5561	5.3084
turbo2_tcq	14.1%	5.5705	5.6085	5.3513
turbo2	13.3%	5.6403	5.6823	5.4287

Full PPL results for all cache modes and configurations are in §11.1.

q5_0 uses 34.4% of bf16 KV and saves 65% memory; q4_0 uses 28.1%. Through q4_0, the entire PPL range is under 0.01. Even turbo3_tcq at 20.3% only adds about 0.02. turbo2 at 13.3% shows a visible hit, but even that is under 0.17 PPL absolute. If PPL were the whole story, the recommendation would be simple: compress aggressively and move on.

PPL averages over every token equally, so one position that destroys a JSON key or hallucinates a closing brace gets diluted by thousands of unremarkable tokens. The metric that picks up that tail damage is KL divergence against the bf16 baseline — specifically the 99.9% KLD, which measures the worst 0.1% of positions. The 99.9% precision column converts it to a percentage: 100 * exp(-(quantKLD - bf16KLD)).

The Q5_K_S 64k KL table shows what PPL hides. Mean precision stays above 99% for almost every mode, but the 99.9% column tells a different story:

Cache	Size vs bf16	Mean KLD	99% KLD	99.9% KLD	99.9% prec.	Tok/s
bf16	100.0%	0.000375	0.005234	0.023258	100.00%	850.81
q8_0	53.1%	0.002328	0.019669	0.078709	94.61%	851.11
q8_0-q5_1	45.3%	0.002529	0.020346	0.082880	94.21%	828.63
q8_0-q5_0	43.8%	0.002656	0.021039	0.088486	93.69%	847.33
q8_0-q4_1	42.2%	0.003080	0.023390	0.099080	92.70%	786.54
q8_0-q4_0	40.6%	0.003316	0.024892	0.104680	92.18%	849.37
q5_1	37.5%	0.002911	0.022604	0.098354	92.77%	841.65
q8_0-turbo3_tcq	36.7%	0.005090	0.037056	0.149387	88.15%	817.57
q5_0	34.4%	0.003206	0.022759	0.099073	92.70%	849.79
q5_1-q4_1	34.4%	0.003380	0.025886	0.095011	93.08%	846.27
q5_0-q4_1	32.8%	0.003471	0.025829	0.099618	92.65%	847.59
q5_1-q4_0	32.8%	0.003626	0.025668	0.108649	91.82%	846.91
q4_1	31.3%	0.004476	0.031166	0.141813	88.82%	854.33
q5_0-q4_0	31.3%	0.003581	0.027423	0.113332	91.39%	847.64
q5_1-turbo3_tcq	28.9%	0.005594	0.038175	0.144591	88.57%	816.05
q4_0	28.1%	0.004711	0.033663	0.130419	89.84%	855.08
q5_0-turbo3_tcq	27.3%	0.005471	0.038214	0.158514	87.35%	815.80
q5_0-turbo3	27.0%	0.007097	0.048761	0.192428	84.44%	837.90
q4_1-turbo3_tcq	25.8%	0.006184	0.042997	0.174831	85.94%	816.95
turbo4	25.8%	0.004760	0.035205	0.138370	89.13%	705.32
q4_0-turbo3_tcq	24.2%	0.006269	0.045421	0.186572	84.93%	821.89
q4_0-turbo3	23.8%	0.008235	0.056527	0.222154	81.96%	839.29
q4_0-turbo2_tcq	21.1%	0.015168	0.105461	0.395244	68.94%	826.07
turbo3_tcq	20.3%	0.007978	0.058286	0.227104	81.56%	795.20
turbo3	19.5%	0.011181	0.082015	0.296060	76.12%	836.75
turbo3_tcq-turbo2_tcq	17.2%	0.016386	0.115072	0.437043	66.11%	796.16
turbo3-turbo2	16.4%	0.023985	0.168258	0.605087	55.89%	831.88
turbo2_tcq	14.1%	0.023073	0.170350	0.632401	54.38%	807.25
turbo2	13.3%	0.036230	0.276438	0.903576	41.47%	842.29

Full KL tables for all three configurations are in §11.2.

The mean KLD column still looks reasonable: most modes stay below 0.01. But the 99.9% column diverges sharply. q5_0 at 34.4% of bf16 KV has a 99.9% KLD of 0.099, which is 42× its mean. q4_0 at 28.1% jumps to 0.130, which looks small as a number but is a 32% increase over q5_0 in the tail, and this will break your tool calls. Below q4_0, turbo modes fall off a cliff: turbo3_tcq at 20.3% of bf16 KV reaches 0.227 in the tail, and turbo2 at 13.3% hits 0.904, so roughly one full nat of divergence at the worst one-in-a-thousand positions.

The size column makes the comparison direct. q4_0 saves 71.9% of bf16 KV and keeps 89.84% of 99.9% precision. turbo3_tcq saves 79.7% and keeps 81.56%. turbo2 saves 86.7% and collapses to 41.47%. The PPL table suggested turbo3_tcq was fine; the tail says it is not, unless you genuinely need the memory and can accept the risk. And then turbo2 is only viable for tasks where exact structure does not matter.

3. Rotation Changed the Playing Field

The q4/q5/q8 modes in these results are not naive scalar quantization. llama.cpp applies a random rotation to KV cache vectors before quantizing them — the same trick TurboQuant uses. The rotation spreads outlier energy across coordinates, so the scalar quantizer faces a more uniform distribution and makes fewer catastrophic rounding decisions on extreme values. The difference is what happens after rotation: q4/q5 quantize each value independently to 4–5 bits with a scalar codebook, while TurboQuant quantizes to 2–3 bits with a scalar codebook and optionally adds a QJL residual for inner-product estimation. TCQ (§4) constrains index sequences through a trellis instead.

TurboQuant is also slower. turbo4 runs at ~705 tok/s (prefill) versus ~850 tok/s for q4_0 on the Q5_K_S config, a 17% throughput penalty. turbo3_tcq runs at ~794–844 tok/s across the three configs, slower than q4_0 in every case. The rotation and QJL stages add compute that scalar quantization avoids.

The rotation overlap means TurboQuant is not competing against a helpless baseline. It is competing against q4/q5 modes that already benefit from the same outlier smoothing. At 2–3 bits, the scalar codebook is starved enough that rotation alone cannot save you, and TurboQuant's extra structure matters. At 4 bits, turbo4 has no quality advantage over q4_0, saves almost no memory (25.8% vs 28.1% of bf16 KV), and runs slower. The value in TurboQuant is at the low bit rates where q4/q5 cannot reach at all.

4. What TCQ Adds

TCQ changes the low-bit side of the table. buun's paper ("Closing the Gap: Trellis-Coded Quantization for KV Cache at 2–3 Bits") describes TCQ as the first application of trellis-coded quantization to LLM KV cache compression. Instead of choosing each scalar index independently, TCQ constrains index sequences through a finite-state trellis, which gives a much larger effective codebook at the same bit rate. The encoder uses Viterbi to find a globally optimal assignment, while the decode path stays parallel: each value can be decoded in O(1), which is why the method can still fit GPU flash attention kernels.

The reported TCQ results claim 10–44% KL-divergence reduction over scalar quantization at 2–3 bits per value, with context-adaptive norm scaling and FWHT rotation plus random sign flips. That matches what this BeeLlama bench sees directionally: turbo3_tcq is consistently much better than plain turbo3, and turbo2_tcq is much better than plain turbo2, especially in the 99.9% tail.

There is no turbo4_tcq in this table, and that makes sense. TCQ is most valuable where independent scalar codes are starved for bits. At 2–3 bits, the larger effective codebook can close a visible gap. At 4 bits, the ordinary scalar codebook already has enough resolution, and the extra trellis decode overhead is not worth paying for marginal quality gains on top of a mode that is already uncompetitive with q4_0.

5. Asymmetric KV: Where the Bits Should Go

The asymmetric rows are the most useful part of the final report. They confirm the K-first intuition, but with a limit.

At 31.3% of bf16 KV, q5_0-q4_0 has the same size as symmetric q4_1, yet it beats q4_1 across all three 99.9% precision tables: 91.39% vs 88.82% in Q5_K_S 64k, 96.06% vs 93.73% in IQ4_XS 64k, and 96.18% vs 95.04% in IQ4_XS 128k. Spending the same budget of bits on a stronger K and weaker V outperforms splitting them evenly.

q5_0-q4_0 is the clean trade at that size: same footprint as q4_1, but better tail behavior from spending bits asymmetrically toward K. One step up, q5_0-q4_1 costs 32.8% of bf16 KV, only 1.5 points more than q5_0-q4_0, and improves the 99.9% tail to 92.65%, 96.21%, and 96.50% across the three KL tables. Not a huge jump, but it is cheap.

The direct comparison between q5_0-q4_1 and q5_1-q4_0 answers the next-bit question. Both are the same size. q5_0-q4_1 is better in Q5_K_S and IQ4_XS 128k, while IQ4_XS 64k is effectively tied in 99.9% precision. After K reaches q5_0, the next useful bit appears to go to V, not to q5_1 K.

q5_1 remains useful as an extra-conservative option, but it is not where the marginal value is.

6. TurboQuant's Narrowed Niche

The bench is harsh on turbo modes, but it does not invalidate them. I'd say it just clearly defines their niche, albeit one that is much more narrow than I expected.

turbo4 is the easiest mode to drop from recommendations. At IQ4_XS 128k, it saves only 192 MiB over q4_0, 2112 MiB vs 2304 MiB, while throughput drops from 710.51 tok/s to 519.76 tok/s and 99.9% precision falls from 94.34% to 93.07%. That is not a good exchange.

Plain turbo3 and turbo2 are also hard to recommend when TCQ variants are available. turbo3_tcq beats turbo3 clearly in the tail, and turbo2_tcq is much better than turbo2. Hardware support matters here: if a backend does not support the TCQ variants, users have to choose between paying more VRAM for q4/q5 or accepting the quality loss from the plain turbo modes. But that should be a documented fallback, not the main path.

The valuable turbo result is turbo3_tcq. It lands at 20.3% of bf16 KV, with 99.9% precision of 81.56%, 84.51%, and 86.06% across the three KL tables. That is not q4 quality, and it should not be sold as such. But it is a real precision/VRAM/speed compromise for users who need the context to fit.

turbo2_tcq is the last resort. It keeps broad PPL better than the tail suggests, but the 99.9% precision values are 54.38%, 58.62%, and 63.53%. That is a mode for rough summarization, long-context reading where exactness is not the job, or setups that otherwise cannot run. It is not a tool-call, JSON, code-edit, or math mode.

There is also a small middle rung: q4_0-turbo3_tcq. It uses 24.2% of bf16 KV and sits between q4_0 and symmetric turbo3_tcq in the 99.9% tail. It too is not a precision preset, but it's useful for users who need something smaller than q4 and cleaner than full turbo3_tcq.

7. The q8 Fidelity Tier

Symmetric q8_0 at 53.1% of bf16 KV sits at the top of the size range, and its numbers show it. Mean precision is 99.80% on Q5_K_S, 99.95% on IQ4_XS, and PPL precision actually exceeds bf16 (100.05% on Q5_K_S, 100.02% on IQ4_XS): the quantization noise is small enough that median perplexity changes by less than run-to-run variance. The 99.9% tail tells a slightly different story: 94.61% on Q5_K_S, 98.69% on IQ4_XS 64k, 98.52% on 128k.

Those are the highest tail numbers in the bench, but they are not perfect, and the Q5_K_S gap from 100% is visible. Combined with VRAM costs, this makes full q8_0-q8_0 a validation and blame-isolation mode, not a practical default.

The asymmetric pairs with q8_0 K walk the V down one tier at a time, and each step has a cost that mirrors the V-tier structure the rest of the bench already found. On Q5_K_S 64k:

Dropping V from q8_0 to q5_1: 94.61% to 94.21%, a 0.40-point loss for a 7.8% size reduction. At 45.3% of bf16 KV, q8_0-q5_1 still clears 94% tail precision.
Dropping V from q5_1 to q5_0: 94.21% to 93.69%, 0.52 points for only 1.5% less size. Stepping inside the q5 V range costs roughly half a point per tier, the same "nearly free but not zero" pattern from §5.
Dropping V from q5_0 to q4_1: 93.69% to 92.70%, 0.99 points. This is the q5-to-q4 V cliff: the full-point penalty that shows up everywhere the V crosses below q5.
Dropping V from q4_1 to q4_0: 92.70% to 92.18%, 0.52 points. Same half-point step as the q5 internal tiers.
Dropping V from q4_0 to turbo3_tcq: 92.18% to 88.15%, a 4.03-point collapse. Even q8_0 K cannot rescue turbo3_tcq V, it falls below symmetric q4_0 (89.84%) despite costing more memory.

IQ4_XS shifts the absolute numbers but preserves the step structure. V steps inside q5 cost 0.09–0.14 points, the q5-to-q4 cliff costs 0.94–1.20 points, and the q4_0-to-turbo3_tcq drop is 4.30 points on both 64k and 128k. The same V-tier boundaries that the symmetric and q5-asymmetric modes already showed also govern the q8-tier walkdown.

That walkdown makes the fidelity tier's practical boundaries clear. At q8_0 K, V values inside q5 cost less than half a point per tier and stay above 93.6% tail precision; crossing below q5 V costs a full point. The two useful presets that balance VRAM and precision well are q8_0-q5_1 and q8_0-q5_0.

q8_0-q5_1 at 45.3% of bf16 KV is the best precision-per-size trade at the high end. It gives up only 0.40 points of 99.9% tail precision versus symmetric q8_0 (94.21% vs 94.61%) while shrinking the cache by 7.8 percentage points. On IQ4_XS, the loss is just 0.41 points on 64k (98.28% vs 98.69%) and 0.39 on 128k. Mean precision stays above 99.78% on both weight quants. That is the fidelity preset: the smallest q8-tier config that stays above 94% tail precision, and the one that earns the bold row in §8.

q8_0-q5_0 at 43.8% saves another 1.5% of bf16 KV but costs 0.52 more tail points (93.69% vs 94.21%). It exists as a fallback if q8_0-q5_1 does not fit by a narrow margin, not as a peer recommendation. Where dropping V from q8_0 to q5_1 costs 0.40 points for a 7.8% size reduction, dropping from q5_1 to q5_0 costs 0.52 points for only 1.5%.

Below q5 V, the returns collapse. q8_0-q4_1 at 42.2% loses 0.99 points from q8_0-q5_0 for 1.6% less size. q8_0-q4_0 at 40.6% loses another 0.52 points. Neither beats symmetric q5_0 (34.4%, 92.70%) in the tail despite being 6–8% larger, because the q5-to-q4 V cliff erases the q8_0 K advantage. The comparison with symmetric q5_1 makes this concrete: q8_0-q4_1 is 4.7% larger yet nearly tied with q5_1 at 92.70% vs 92.77%.

The q8_0 K tier buys back nearly all of the q4_1 V penalty, confirming that K precision dominates the tail, but the net result is no better than a fully symmetric pair that costs less. The same pattern holds on IQ4_XS (q8_0-q4_1 at 96.99% is 0.47 points below q5_1 at 97.46% on 64k, 0.65 on 128k). q8_0-q4_1 is more interesting technically than practically (its Q5_K_S speed of 786.54 tok/s is anomalous; other asymmetric pairs at similar sizes and the IQ4_XS re-run both show normal throughput), as it clearly shows the limits of K-vs-V trade-offs.

8. The Preset Ladder

The old recommendation set treated TurboQuant too broadly. The new one should separate ordinary precision modes, fidelity modes, and long-context survival modes.

K / V	% of bf16 size	99.9% precision	What it is for
bf16 / bf16	100	100.00%	Preserving full quality
q8_0 / q8_0	53.1	94.61%	Compression with minimal losses
q8_0 / q5_1	45.3	94.21%	Best precision/size at high end
q8_0 / q5_0	43.8	93.69%	If q8_0 / q5_1 doesn't fit just a bit
q5_0 / q5_0	34.4	92.70%	Good precision/size, relatively safe
q5_0 / q4_1	32.8	92.65%	Best default if VRAM-constrained
q5_0 / q4_0	31.3	91.39%	If q5_0 / q4_1 doesn't fit just a bit
q4_0 / q4_0	28.1	89.84%	Memory saving with precision loss
q4_0 / turbo3_tcq	24.2	84.93%	Smaller than q4, cleaner than turbo3_tcq
turbo3_tcq / turbo3_tcq	20.3	81.56%	Viable as extreme compression
turbo2_tcq / turbo2_tcq	14.1	54.38%	Last resort: not for code and tool calls

The modes I would not present as first-class recommendations are q4_1/q4_1, q5_1/q4_1, q5_1/q4_0, q8_0/q4_0, q8_0/turbo3_tcq, q5_0/turbo3_tcq, q5_1/turbo3_tcq, q4_1/turbo3_tcq, q4_0/turbo2_tcq, plain turbo3, plain turbo2, and turbo4 unless its speed/implementation profile changes.

That is not a rejection of TurboQuant, but a better boundary. If q5 or q4 fits, use q5 or q4. If the job is strict and tool-heavy, prefer q5_0. If the job is memory-bound but still ordinary, use q5_0/q4_0 or q5_0/q4_1. If the job cannot fit without deeper KV compression, that is where turbo3_tcq earns its place. If even that does not fit, turbo2_tcq exists, but the user should know what they are giving up.

The benchmark does not show "Q beats TQ" or "TQ beats Q", but that the line between them has moved. Rotation made ordinary q4/q5 quite decent, TCQ makes very low-bit turbo modes much more viable than plain turbo2/turbo3, and the user wins because there's a somewhat clear preset ladder with obvious trade-offs.

9. Weight Quant Changes the Cost of Cache Quant

The bench ran two weight quantizations: Q5_K_S and IQ4_XS. Same model, same context lengths where they overlap, same cache modes. The comparison is a controlled variable that most KV cache benchmarks do not isolate. It should be: weight precision changes how much cache quantization costs in the tail.

At 64k context, most symmetric cache modes land 3–5% higher in 99.9% precision on IQ4_XS than on Q5_K_S (turbo3_tcq is just under 3%, and turbo2 has already collapsed on both):

Cache	Q5_K_S 99.9% prec.	IQ4_XS 99.9% prec.	Gap
q8_0	94.61%	98.69%	+4.08
q5_0	92.70%	97.32%	+4.62
q4_0	89.84%	93.01%	+3.17
turbo4	89.13%	93.03%	+3.90
turbo3_tcq	81.56%	84.51%	+2.95
turbo3	76.12%	79.27%	+3.15
turbo2_tcq	54.38%	58.62%	+4.24
turbo2	41.47%	41.01%	-0.46

IQ4_XS is consistently less damaged by the same cache quant. The gap exists across almost every mode and only collapses at turbo2, where both weight quants are already in the floor.

The raw KLD numbers make the gap sharper. Q5_K_S q8_0 has a 99.9% KLD of 0.078709; IQ4_XS q8_0 has 0.017372. Same cache mode, 4.5× the tail damage on the higher-precision weights. At q4_0 the ratio drops to 1.7× (0.130419 vs 0.076663): the cache quantization noise starts to dominate over the weight quantization the deeper you compress.

The reason is in what those KV values carry. Q5_K_S produces richer KV distributions: structurally important tokens, sharp probability differences, outlier activations with real signal. IQ4_XS already injects weight noise into the attention path, so its KV distributions are smoother and carry less fine detail.

When you quantize those KV values, you lose more from Q5_K_S because there is more to lose. The absolute output quality of Q5_K_S with bf16 KV is still better, but KL divergence is measured against each model's own bf16 baseline, so a Q5_K_S model at q4_0 cache has moved further from its own potential than an IQ4_XS model at q4_0 has from its (lower) potential.

The practical resolution: the same cache preset is tail-safer on a lower-weight-precision model than on a higher-weight-precision model. If you are running Q5_K_S, you should lean harder toward q5_0 over q4_0 than you would on IQ4_XS.

10. What Else Was Tried

Wikitext PPL and KL divergence are the only benchmarks that made it into this article, but several other tests were tried and abandoned because they could not distinguish between cache quantization modes.

Multiple-choice benchmarks (ARC-Challenge, ARC-Easy, HellaSwag, MMLU) all run at short context, well under 10K tokens. The KV cache is tiny at those lengths, and every cache mode scores within noise of every other mode. They measure whether the model knows things, not whether quantized cache degrades retrieval or reasoning. The spread across cache types was indistinguishable from run-to-run variance.

Perplexity-based HellaSwag and Winogrande have the same problem: they do not exercise the KV cache enough to show a difference. They confirm that the model still speaks English after cache quantization, which was never in doubt.

Synthetic passkey and needle-in-a-haystack retrieval were tried at 32K context. Every cache type, including turbo2, scored 100%. The model can regurgitate a hidden string just fine even with aggressively quantized cache, because retrieval of a single attended token is a different failure mode than slowly diverging output distributions over thousands of tokens. A 30-question passkey test does not have the statistical power to catch what KL divergence catches.

JSON schema generation (structured output constrained by a schema) also hit 100% for all cache types. Cache quantization does not break template compliance at the scale tested.

Native passkey via llama-passkey was excluded because it requires the model's native context length of 262144 tokens, which OOMs on 24 GB VRAM and has no --ctx-size override.

AIME (30 mathematics competition problems) is the only discarded benchmark that could show a difference. The problem is cost: on an RTX 3090, one pass through the 30 problems for a single cache configuration would take roughly 3–4 days of non-stop inference. I committed to it anyway and ran bf16, q8_0, and q4_0 before realizing that a single pass is nearly useless: 30 questions is wildly noisy for a binary measure of correctness, with q8_0 and q4_0 getting the same result. You need dozens of runs to average out the variance, and by the time that finishes, AGI will have already been invented and the whole topic of local inference optimization will be moot.

11. Benchmark Data

11.1 Perplexity

Q5_K_S 64k

Cache	KV cache (MiB)	Size vs bf16	Median PPL	Precision vs bf16	PPL +/-	Same top p	Tok/s	Elapsed (s)
bf16	4096.00	100.0%	5.4800	100.00%	0.03465	99.647% +/- 0.016%	851.75	326.30
q8_0	2176.00	53.1%	5.4774	100.05%	0.03465	97.942% +/- 0.039%	851.57	331.27
q5_1	1536.00	37.5%	5.4777	100.04%	0.03464	97.787% +/- 0.041%	848.27	332.64
q5_0	1408.00	34.4%	5.4802	100.00%	0.03466	97.707% +/- 0.041%	848.36	332.45
q4_1	1280.00	31.3%	5.4808	99.99%	0.03467	97.259% +/- 0.045%	853.49	330.43
q4_0	1152.00	28.1%	5.4877	99.86%	0.03473	97.179% +/- 0.046%	853.50	330.76
turbo4	1056.00	25.8%	5.4841	99.93%	0.03468	97.037% +/- 0.047%	705.06	395.16
turbo3_tcq	832.00	20.3%	5.5054	99.54%	0.03480	96.265% +/- 0.052%	794.21	353.31
turbo3	800.00	19.5%	5.5149	99.37%	0.03493	95.517% +/- 0.057%	802.71	344.83
turbo2_tcq	576.00	14.1%	5.5705	98.38%	0.03566	93.456% +/- 0.068%	805.25	348.78
turbo2	544.00	13.3%	5.6403	97.16%	0.03581	91.646% +/- 0.076%	840.34	335.07

IQ4_XS 64k

Cache	KV cache (MiB)	Size vs bf16	Median PPL	Precision vs bf16	PPL +/-	Same top p	Tok/s	Elapsed (s)
bf16	4096.00	100.0%	5.5169	100.00%	0.03497	99.776% +/- 0.013%	909.83	336.43
q8_0	2176.00	53.1%	5.5157	100.02%	0.03499	98.950% +/- 0.028%	910.43	309.49
q5_1	1536.00	37.5%	5.5181	99.98%	0.03501	98.618% +/- 0.032%	906.38	310.47
q5_0	1408.00	34.4%	5.5175	99.99%	0.03500	98.553% +/- 0.033%	906.88	310.27
q4_1	1280.00	31.3%	5.5237	99.88%	0.03505	97.880% +/- 0.040%	911.35	308.85
q4_0	1152.00	28.1%	5.5251	99.85%	0.03505	97.793% +/- 0.041%	912.54	308.61
turbo4	1056.00	25.8%	5.5277	99.80%	0.03508	97.652% +/- 0.042%	746.00	372.98
turbo3_tcq	832.00	20.3%	5.5426	99.54%	0.03513	96.569% +/- 0.050%	844.33	331.64
turbo3	800.00	19.5%	5.5561	99.29%	0.03533	95.746% +/- 0.056%	882.76	318.29
turbo2_tcq	576.00	14.1%	5.6085	98.37%	0.03599	93.669% +/- 0.067%	858.66	326.24
turbo2	544.00	13.3%	5.6823	97.09%	0.03621	91.865% +/- 0.076%	898.66	312.75

IQ4_XS 128k

Cache	KV cache (MiB)	Size vs bf16	Median PPL	Precision vs bf16	PPL +/-	Same top p	Tok/s	Elapsed (s)
bf16	8192.00	100.0%	5.2724	100.00%	0.03269	99.995% +/- 0.002%	703.61	389.94
q8_0	4352.00	53.1%	5.2716	100.02%	0.03271	98.950% +/- 0.028%	707.02	387.84
q5_1	3072.00	37.5%	5.2723	100.00%	0.03271	98.603% +/- 0.032%	702.53	390.33
q5_0	2816.00	34.4%	5.2738	99.97%	0.03272	98.543% +/- 0.033%	703.18	390.05
q4_1	2560.00	31.3%	5.2772	99.91%	0.03276	97.961% +/- 0.039%	708.53	387.23
q4_0	2304.00	28.1%	5.2803	99.85%	0.03276	97.793% +/- 0.041%	709.40	386.58
turbo4	2112.00	25.8%	5.2822	99.81%	0.03281	97.639% +/- 0.042%	520.46	520.82
turbo3_tcq	1664.00	20.3%	5.2985	99.51%	0.03281	96.591% +/- 0.050%	647.40	421.95
turbo3	1600.00	19.5%	5.3084	99.32%	0.03301	95.861% +/- 0.055%	689.73	396.86
turbo2_tcq	1152.00	14.1%	5.3513	98.53%	0.03363	93.807% +/- 0.067%	655.44	416.90
turbo2	1088.00	13.3%	5.4287	97.12%	0.03386	92.033% +/- 0.075%	696.53	393.24

11.2 KL Divergence

KL precision uses 100 * exp(-(quantKLD - bf16KLD)). The 99.9% precision column applies the same formula to 99.9% KLD. Symmetric and asymmetric cache configurations are combined in each table; asymmetric cache names are K-V.

Q5_K_S 64k

Cache	KV cache (MiB)	Size vs bf16	Mean KLD	Precision vs bf16	KLD +/-	90% KLD	95% KLD	99% KLD	99.9% KLD	99.9% precision vs bf16	Maximum KLD	Same top p	Tok/s	Elapsed (s)
bf16	4096.00	100.0%	0.000375	100.00%	0.000058	0.000568	0.001693	0.005234	0.023258	100.00%	7.374046	99.647% +/- 0.016%	850.81	337.59
q8_0	2176.00	53.1%	0.002328	99.80%	0.000125	0.004233	0.006656	0.019669	0.078709	94.61%	14.355996	97.942% +/- 0.039%	851.11	337.66
q8_0-q5_1	1856.00	45.3%	0.002529	99.78%	0.000143	0.004557	0.007108	0.020346	0.082880	94.21%	15.367683	97.884% +/- 0.040%	828.63	346.71
q8_0-q5_0	1792.00	43.8%	0.002656	99.77%	0.000168	0.004673	0.007348	0.021039	0.088486	93.69%	17.987650	97.826% +/- 0.040%	847.33	338.90
q8_0-q4_1	1728.00	42.2%	0.003080	99.73%	0.000115	0.005645	0.008587	0.023390	0.099080	92.70%	8.073231	97.655% +/- 0.042%	786.54	364.58
q8_0-q4_0	1664.00	40.6%	0.003316	99.71%	0.000165	0.005976	0.009075	0.024892	0.104680	92.18%	13.481506	97.532% +/- 0.043%	849.37	338.13
q5_1	1536.00	37.5%	0.002911	99.75%	0.000167	0.005045	0.007916	0.022604	0.098354	92.77%	13.397068	97.787% +/- 0.041%	841.65	341.63
q8_0-turbo3_tcq	1504.00	36.7%	0.005090	99.53%	0.000188	0.009736	0.014401	0.037056	0.149387	88.15%	20.128752	96.899% +/- 0.048%	817.57	350.23
q5_0	1408.00	34.4%	0.003206	99.72%	0.000286	0.005232	0.008194	0.022759	0.099073	92.70%	22.619892	97.707% +/- 0.041%	849.79	338.00
q5_1-q4_1	1408.00	34.4%	0.003380	99.70%	0.000195	0.006140	0.009479	0.025886	0.095011	93.08%	21.394011	97.529% +/- 0.043%	846.27	339.25
q5_0-q4_1	1344.00	32.8%	0.003471	99.69%	0.000206	0.006310	0.009582	0.025829	0.099618	92.65%	21.863117	97.539% +/- 0.043%	847.59	339.65
q5_1-q4_0	1344.00	32.8%	0.003626	99.68%	0.000212	0.006441	0.009773	0.025668	0.108649	91.82%	15.809726	97.515% +/- 0.043%	846.91	339.23
q4_1	1280.00	31.3%	0.004476	99.59%	0.000267	0.007716	0.011901	0.031166	0.141813	88.82%	18.150869	97.259% +/- 0.045%	854.33	336.49
q5_0-q4_0	1280.00	31.3%	0.003581	99.68%	0.000174	0.006600	0.010058	0.027423	0.113332	91.39%	14.938599	97.437% +/- 0.044%	847.64	338.79
q5_1-turbo3_tcq	1184.00	28.9%	0.005594	99.48%	0.000291	0.010264	0.015324	0.038175	0.144591	88.57%	24.684429	96.878% +/- 0.048%	816.05	350.73
q4_0	1152.00	28.1%	0.004711	99.57%	0.000301	0.008439	0.012949	0.033663	0.130419	89.84%	21.636135	97.179% +/- 0.046%	855.08	336.11
q5_0-turbo3_tcq	1120.00	27.3%	0.005471	99.49%	0.000265	0.010259	0.015229	0.038214	0.158514	87.35%	22.268801	96.865% +/- 0.048%	815.80	350.94
q5_0-turbo3	1104.00	27.0%	0.007097	99.33%	0.000259	0.013747	0.020259	0.048761	0.192428	84.44%	18.094296	96.331% +/- 0.052%	837.90	342.47
q4_1-turbo3_tcq	1056.00	25.8%	0.006184	99.42%	0.000292	0.011652	0.017320	0.042997	0.174831	85.94%	25.079035	96.663% +/- 0.050%	816.95	350.43
turbo4	1056.00	25.8%	0.004760	99.55%	0.000201	0.009046	0.013692	0.035205	0.138370	89.13%	13.967494	97.037% +/- 0.047%	705.32	401.18
q4_0-turbo3_tcq	992.00	24.2%	0.006269	99.41%	0.000270	0.012220	0.018173	0.045421	0.186572	84.93%	23.157375	96.622% +/- 0.050%	821.89	349.67
q4_0-turbo3	976.00	23.8%	0.008235	99.22%	0.000336	0.015576	0.022828	0.056527	0.222154	81.96%	24.353268	96.075% +/- 0.054%	839.29	341.78
q4_0-turbo2_tcq	864.00	21.1%	0.015168	98.53%	0.000288	0.031826	0.045569	0.105461	0.395244	68.94%	20.743238	94.591% +/- 0.062%	826.07	347.04
turbo3_tcq	832.00	20.3%	0.007978	99.24%	0.000267	0.015663	0.023628	0.058286	0.227104	81.56%	20.517471	96.265% +/- 0.052%	795.20	359.09
turbo3	800.00	19.5%	0.011181	98.93%	0.000304	0.022805	0.034209	0.082015	0.296060	76.12%	22.977211	95.517% +/- 0.057%	836.75	342.73
turbo3_tcq-turbo2_tcq	704.00	17.2%	0.016386	98.41%	0.000283	0.034186	0.049133	0.115072	0.437043	66.11%	18.275532	94.379% +/- 0.064%	796.16	358.86
turbo3-turbo2	672.00	16.4%	0.023985	97.67%	0.000403	0.050100	0.072850	0.168258	0.605087	55.89%	20.812553	93.154% +/- 0.070%	831.88	344.85
turbo2_tcq	576.00	14.1%	0.023073	97.76%	0.000420	0.048777	0.071865	0.170350	0.632401	54.38%	24.771320	93.456% +/- 0.068%	807.25	354.12
turbo2	544.00	13.3%	0.036230	96.48%	0.000465	0.078942	0.117545	0.276438	0.903576	41.47%	26.508263	91.646% +/- 0.076%	842.29	340.66

IQ4_XS 64k

Cache	KV cache (MiB)	Size vs bf16	Mean KLD	Precision vs bf16	KLD +/-	90% KLD	95% KLD	99% KLD	99.9% KLD	99.9% precision vs bf16	Maximum KLD	Same top p	Tok/s	Elapsed (s)
bf16	4096.00	100.0%	0.000097	100.00%	0.000020	0.000186	0.000398	0.001062	0.004152	100.00%	2.345056	99.776% +/- 0.013%	909.80	315.80
q8_0	2176.00	53.1%	0.000577	99.95%	0.000073	0.000933	0.001428	0.004050	0.017372	98.69%	8.130807	98.950% +/- 0.028%	912.71	314.76
q8_0-q5_1	1856.00	45.3%	0.000836	99.93%	0.000110	0.001291	0.001929	0.005272	0.021544	98.28%	11.861989	98.814% +/- 0.030%	895.23	320.91
q8_0-q5_0	1792.00	43.8%	0.000881	99.92%	0.000107	0.001392	0.002092	0.005574	0.022435	98.19%	10.867614	98.714% +/- 0.031%	906.00	316.81
q8_0-q4_1	1728.00	42.2%	0.001317	99.88%	0.000122	0.002436	0.003572	0.008974	0.034706	96.99%	6.675264	98.357% +/- 0.035%	818.78	346.30
q8_0-q4_0	1664.00	40.6%	0.001606	99.85%	0.000118	0.002793	0.004084	0.009969	0.039299	96.55%	8.993986	98.309% +/- 0.036%	908.09	316.08
q5_1	1536.00	37.5%	0.001019	99.91%	0.000075	0.001787	0.002724	0.007262	0.029854	97.46%	6.707493	98.618% +/- 0.032%	907.45	316.44
q8_0-turbo3_tcq	1504.00	36.7%	0.003336	99.68%	0.000119	0.006580	0.009451	0.022374	0.084818	92.25%	11.130499	97.411% +/- 0.044%	871.88	328.10
q5_0	1408.00	34.4%	0.001135	99.90%	0.000088	0.002028	0.003113	0.008113	0.031348	97.32%	8.001913	98.553% +/- 0.033%	908.72	315.93
q5_1-q4_1	1408.00	34.4%	0.001683	99.84%	0.000140	0.002928	0.004329	0.010956	0.038976	96.58%	11.203689	98.302% +/- 0.036%	906.39	316.62
q5_0-q4_1	1344.00	32.8%	0.001529	99.86%	0.000037	0.003073	0.004593	0.011795	0.042828	96.21%	2.328933	98.227% +/- 0.036%	905.93	316.98
q5_1-q4_0	1344.00	32.8%	0.001813	99.83%	0.000163	0.003236	0.004782	0.011742	0.042893	96.20%	18.077213	98.160% +/- 0.037%	905.51	317.08
q4_1	1280.00	31.3%	0.002316	99.78%	0.000104	0.004441	0.006776	0.016734	0.068858	93.73%	8.874204	97.880% +/- 0.040%	913.32	314.50
q5_0-q4_0	1280.00	31.3%	0.001936	99.82%	0.000147	0.003368	0.005013	0.012419	0.044393	96.06%	14.364779	98.125% +/- 0.037%	906.57	316.61
q5_1-turbo3_tcq	1184.00	28.9%	0.003560	99.65%	0.000130	0.007025	0.010176	0.024077	0.088706	91.89%	10.154081	97.304% +/- 0.045%	870.78	328.52
q4_0	1152.00	28.1%	0.002759	99.73%	0.000141	0.005219	0.007950	0.019862	0.076663	93.01%	10.045764	97.793% +/- 0.041%	914.36	314.20
q5_0-turbo3_tcq	1120.00	27.3%	0.003600	99.65%	0.000099	0.007198	0.010485	0.024752	0.102109	90.67%	9.033820	97.295% +/- 0.045%	869.81	328.94
q5_0-turbo3	1104.00	27.0%	0.005209	99.49%	0.000150	0.010602	0.015245	0.036580	0.134359	87.79%	11.506174	96.750% +/- 0.049%	894.86	320.42
q4_1-turbo3_tcq	1056.00	25.8%	0.004226	99.59%	0.000107	0.008465	0.012513	0.029618	0.121854	88.90%	8.723818	97.117% +/- 0.046%	871.93	328.15
turbo4	1056.00	25.8%	0.002988	99.71%	0.000130	0.005881	0.008839	0.021868	0.076363	93.03%	9.168183	97.652% +/- 0.042%	744.85	379.35
q4_0-turbo3_tcq	992.00	24.2%	0.004466	99.56%	0.000123	0.009097	0.013470	0.032288	0.108662	90.08%	9.383696	97.067% +/- 0.047%	871.33	328.57
q4_0-turbo3	976.00	23.8%	0.006007	99.41%	0.000136	0.012352	0.018044	0.042428	0.161644	85.43%	11.092446	96.508% +/- 0.051%	897.34	319.56
q4_0-turbo2_tcq	864.00	21.1%	0.013595	98.66%	0.000204	0.028687	0.041321	0.095393	0.367825	69.51%	14.222522	94.742% +/- 0.062%	881.14	325.07
turbo3_tcq	832.00	20.3%	0.006038	99.41%	0.000127	0.012714	0.019058	0.046219	0.172480	84.51%	9.415985	96.569% +/- 0.050%	845.55	337.45
turbo3	800.00	19.5%	0.009102	99.10%	0.000164	0.019502	0.029311	0.068266	0.236472	79.27%	10.847077	95.746% +/- 0.056%	894.11	320.59
turbo3_tcq-turbo2_tcq	704.00	17.2%	0.014461	98.57%	0.000165	0.031014	0.045330	0.104559	0.374854	69.02%	10.150832	94.578% +/- 0.063%	847.59	336.76
turbo3-turbo2	672.00	16.4%	0.022168	97.82%	0.000271	0.046698	0.068008	0.160327	0.602649	54.96%	18.985191	93.434% +/- 0.068%	884.98	323.44
turbo2_tcq	576.00	14.1%	0.020739	97.96%	0.000230	0.045497	0.068026	0.161256	0.538190	58.62%	15.582352	93.669% +/- 0.067%	861.17	331.75
turbo2	544.00	13.3%	0.034380	96.63%	0.000340	0.075876	0.113734	0.265535	0.895385	41.01%	19.482079	91.865% +/- 0.076%	901.01	318.44

IQ4_XS 128k

Cache	KV cache (MiB)	Size vs bf16	Mean KLD	Precision vs bf16	KLD +/-	90% KLD	95% KLD	99% KLD	99.9% KLD	99.9% precision vs bf16	Maximum KLD	Same top p	Tok/s	Elapsed (s)
bf16	8192.00	100.0%	0.000000	100.00%	0.000000	0.000015	0.000023	0.000037	0.000051	100.00%	0.000067	99.995% +/- 0.002%	702.50	400.97
q8_0	4352.00	53.1%	0.000482	99.95%	0.000007	0.000983	0.001508	0.004061	0.014951	98.52%	0.478175	98.950% +/- 0.028%	708.31	397.81
q8_0-q5_1	3712.00	45.3%	0.000651	99.93%	0.000012	0.001335	0.002010	0.005161	0.018918	98.13%	1.270212	98.779% +/- 0.030%	694.87	405.20
q8_0-q5_0	3584.00	43.8%	0.000703	99.93%	0.000013	0.001433	0.002141	0.005523	0.020360	97.99%	1.166986	98.757% +/- 0.031%	702.31	400.84
q8_0-q4_1	3456.00	42.2%	0.001149	99.89%	0.000012	0.002453	0.003582	0.008568	0.029970	97.05%	0.964733	98.407% +/- 0.035%	637.52	440.27
q8_0-q4_0	3328.00	40.6%	0.001295	99.87%	0.000016	0.002765	0.003998	0.009587	0.035741	96.49%	1.614931	98.304% +/- 0.036%	706.17	398.89
q5_1	3072.00	37.5%	0.000827	99.92%	0.000008	0.001792	0.002687	0.006764	0.023291	97.70%	0.496846	98.603% +/- 0.032%	702.81	400.71
q8_0-turbo3_tcq	3008.00	36.7%	0.003167	99.68%	0.000029	0.006791	0.009860	0.022935	0.081350	92.19%	2.329764	97.407% +/- 0.044%	672.90	417.21
q5_0	2816.00	34.4%	0.000926	99.91%	0.000007	0.002037	0.003067	0.007468	0.027410	97.30%	0.427949	98.543% +/- 0.033%	704.01	400.14
q5_1-q4_1	2816.00	34.4%	0.001335	99.87%	0.000013	0.002884	0.004221	0.010047	0.035062	96.56%	0.850439	98.334% +/- 0.035%	703.75	400.05
q5_0-q4_1	2688.00	32.8%	0.001387	99.86%	0.000013	0.003017	0.004437	0.010411	0.035706	96.50%	0.971476	98.243% +/- 0.036%	702.90	400.58
q5_1-q4_0	2688.00	32.8%	0.001485	99.85%	0.000014	0.003200	0.004746	0.011342	0.040530	96.03%	0.976738	98.200% +/- 0.037%	702.58	400.94
q4_1	2560.00	31.3%	0.001933	99.81%	0.000013	0.004318	0.006595	0.016046	0.050918	95.04%	0.435837	97.961% +/- 0.039%	709.38	397.11
q5_0-q4_0	2560.00	31.3%	0.001529	99.85%	0.000016	0.003316	0.004868	0.011640	0.039033	96.18%	1.116606	98.211% +/- 0.037%	704.22	399.88
q5_1-turbo3_tcq	2368.00	28.9%	0.003360	99.66%	0.000029	0.007229	0.010496	0.025104	0.089474	91.45%	2.237170	97.375% +/- 0.044%	670.63	418.53
q4_0	2304.00	28.1%	0.002259	99.77%	0.000017	0.005058	0.007697	0.018505	0.058301	94.34%	1.074671	97.793% +/- 0.041%	710.51	396.57
q5_0-turbo3_tcq	2240.00	27.3%	0.003391	99.66%	0.000030	0.007321	0.010567	0.024422	0.090901	91.32%	2.252987	97.384% +/- 0.044%	670.54	418.68
q5_0-turbo3	2208.00	27.0%	0.004728	99.53%	0.000035	0.010375	0.014767	0.034198	0.121340	88.58%	1.809964	96.732% +/- 0.049%	693.49	405.71
q4_1-turbo3_tcq	2112.00	25.8%	0.003981	99.60%	0.000034	0.008612	0.012701	0.030111	0.112812	89.34%	2.193686	97.182% +/- 0.046%	672.84	417.30
turbo4	2112.00	25.8%	0.002605	99.74%	0.000024	0.005701	0.008653	0.020367	0.071902	93.07%	1.179263	97.639% +/- 0.042%	519.76	531.96
q4_0-turbo3_tcq	1984.00	24.2%	0.004131	99.59%	0.000032	0.009062	0.013401	0.031435	0.112297	89.38%	2.139016	97.078% +/- 0.047%	671.63	418.13
q4_0-turbo3	1952.00	23.8%	0.005488	99.45%	0.000040	0.012073	0.017644	0.040047	0.145158	86.49%	1.545795	96.549% +/- 0.050%	695.78	404.37
q4_0-turbo2_tcq	1728.00	21.1%	0.013329	98.68%	0.000090	0.029805	0.042793	0.096811	0.306630	73.60%	6.779726	94.713% +/- 0.062%	678.10	414.30
turbo3_tcq	1664.00	20.3%	0.005708	99.43%	0.000045	0.012748	0.019441	0.045320	0.150144	86.06%	2.079510	96.591% +/- 0.050%	647.62	432.43
turbo3	1600.00	19.5%	0.008334	99.17%	0.000057	0.019023	0.028596	0.066157	0.207468	81.27%	2.454834	95.861% +/- 0.055%	691.41	406.63
turbo3_tcq-turbo2_tcq	1408.00	17.2%	0.014344	98.58%	0.000086	0.032243	0.046737	0.105269	0.343951	70.90%	4.010866	94.530% +/- 0.063%	648.06	432.26
turbo3-turbo2	1344.00	16.4%	0.020468	97.97%	0.000154	0.045514	0.065976	0.147745	0.474415	62.23%	11.938387	93.540% +/- 0.068%	686.57	409.17
turbo2_tcq	1152.00	14.1%	0.019857	98.03%	0.000122	0.045491	0.068398	0.158766	0.453761	63.53%	4.370085	93.807% +/- 0.067%	656.84	426.67
turbo2	1088.00	13.3%	0.032631	96.79%	0.000203	0.073833	0.111765	0.261607	0.838113	43.25%	4.735642	92.033% +/- 0.075%	698.21	402.96

12. q6_0 Follow-Up Benchmarks

Performed using pre-release v0.3.0 of BeeLlama that added the q6_0 cache mode. New modes tested: symmetric q6_0, asymmetric q6_0 pairs with q5_1, q5_0, q4_1, q4_0, turbo4, and turbo3_tcq V; plus q8_0-q6_0, q8_0-turbo4, and q5_0-turbo4. The KL tables also include re-runs of existing q8_0, q5_0, q4_0, and turbo3_tcq as a control group to verify stability after a build change. All control results matched within noise.

12.1 Perplexity

Q5_K_S 64k

Cache	KV cache (MiB)	Size vs bf16	Median PPL	Precision vs bf16	PPL +/-	Same top p	Tok/s	Elapsed (s)
q6_0	1664.00	40.6%	5.4778	100.04%	0.03465	97.890% +/- 0.040%	852.96	336.74

IQ4_XS 64k

Cache	KV cache (MiB)	Size vs bf16	Median PPL	Precision vs bf16	PPL +/-	Same top p	Tok/s	Elapsed (s)
q6_0	1664.00	40.6%	5.5171	100.00%	0.03500	98.878% +/- 0.029%	922.50	311.45

IQ4_XS 128k

Cache	KV cache (MiB)	Size vs bf16	Median PPL	Precision vs bf16	PPL +/-	Same top p	Tok/s	Elapsed (s)
q6_0	3328.00	40.6%	5.2729	99.99%	0.03272	98.855% +/- 0.029%	720.53	390.94

12.2 KL Divergence

Same format as §11.2. Sorted by KV cache size descending. Includes re-run control rows for staple symmetric pairs (q8_0, q5_0, q4_0, turbo3_tcq) alongside the new entries.

Q5_K_S 64k

Cache	KV cache (MiB)	Size vs bf16	Mean KLD	Precision vs bf16	KLD +/-	90% KLD	95% KLD	99% KLD	99.9% KLD	99.9% precision vs bf16	Maximum KLD	Same top p	Tok/s	Elapsed (s)
q8_0	2176.00	53.1%	0.002358	99.80%	0.000139	0.004179	0.006550	0.018569	0.078548	94.62%	13.849484	97.935% +/- 0.039%	849.29	338.22
q8_0-q6_0	1920.00	46.9%	0.002499	99.79%	0.000184	0.004295	0.006708	0.019381	0.081616	94.33%	17.816196	97.942% +/- 0.039%	848.78	338.40
q6_0	1664.00	40.6%	0.002614	99.78%	0.000180	0.004426	0.006949	0.020078	0.090800	93.47%	14.112586	97.890% +/- 0.040%	845.96	339.52
q8_0-turbo4	1616.00	39.5%	0.003561	99.68%	0.000215	0.006518	0.009834	0.026426	0.103041	92.33%	23.102724	97.460% +/- 0.043%	838.90	342.38
q6_0-q5_1	1600.00	39.1%	0.002781	99.76%	0.000228	0.004682	0.007348	0.020998	0.090447	93.50%	23.770491	97.913% +/- 0.039%	846.24	339.41
q6_0-q5_0	1536.00	37.5%	0.002820	99.76%	0.000209	0.004748	0.007457	0.021883	0.092682	93.29%	23.186867	97.788% +/- 0.041%	846.86	339.16
q6_0-q4_1	1472.00	35.9%	0.003312	99.71%	0.000232	0.005755	0.008847	0.024387	0.104582	92.19%	23.244659	97.605% +/- 0.042%	848.42	338.54
q5_0	1408.00	34.4%	0.002826	99.76%	0.000139	0.005238	0.008182	0.022704	0.094147	93.16%	14.572817	97.717% +/- 0.041%	846.21	339.45
q6_0-q4_0	1408.00	34.4%	0.003288	99.71%	0.000129	0.006096	0.009294	0.025456	0.111566	91.55%	10.711100	97.524% +/- 0.043%	848.24	338.61
q6_0-turbo4	1360.00	33.2%	0.003748	99.66%	0.000224	0.006642	0.009997	0.026902	0.107377	91.93%	16.445103	97.465% +/- 0.043%	837.77	342.84
q6_0-turbo3_tcq	1248.00	30.5%	0.005379	99.50%	0.000247	0.009906	0.014556	0.037285	0.154680	87.68%	19.739548	96.922% +/- 0.048%	819.23	350.60
q5_0-turbo4	1232.00	30.1%	0.003812	99.66%	0.000176	0.007068	0.010735	0.028203	0.112249	91.49%	17.032024	97.371% +/- 0.044%	837.52	342.95
q4_0	1152.00	28.1%	0.004717	99.57%	0.000257	0.008445	0.012921	0.034304	0.141288	88.87%	17.415379	97.117% +/- 0.046%	850.09	337.90
turbo3_tcq	832.00	20.3%	0.007978	99.24%	0.000267	0.015663	0.023628	0.058286	0.227104	81.56%	20.517471	96.265% +/- 0.052%	795.23	361.21

IQ4_XS 64k

Cache	KV cache (MiB)	Size vs bf16	Mean KLD	Precision vs bf16	KLD +/-	90% KLD	95% KLD	99% KLD	99.9% KLD	99.9% precision vs bf16	Maximum KLD	Same top p	Tok/s	Elapsed (s)
q8_0	2176.00	53.1%	0.000685	99.94%	0.000111	0.000938	0.001436	0.004074	0.016983	98.73%	10.878341	98.975% +/- 0.028%	909.48	315.96
q8_0-q6_0	1920.00	46.9%	0.000659	99.94%	0.000093	0.001032	0.001578	0.004419	0.018670	98.56%	10.625672	98.906% +/- 0.029%	908.70	316.18
q6_0	1664.00	40.6%	0.000766	99.93%	0.000109	0.001179	0.001791	0.004762	0.020407	98.39%	11.368995	98.878% +/- 0.029%	906.47	316.96
q8_0-turbo4	1616.00	39.5%	0.001845	99.83%	0.000108	0.003311	0.004787	0.011551	0.046124	95.89%	8.488309	98.147% +/- 0.037%	898.63	319.72
q6_0-q5_1	1600.00	39.1%	0.000882	99.92%	0.000099	0.001431	0.002169	0.005746	0.021968	98.23%	10.728148	98.772% +/- 0.030%	906.67	316.89
q6_0-q5_0	1536.00	37.5%	0.000933	99.92%	0.000103	0.001519	0.002269	0.006044	0.023588	98.08%	10.475766	98.666% +/- 0.032%	906.68	316.89
q6_0-q4_1	1472.00	35.9%	0.001488	99.87%	0.000115	0.002593	0.003830	0.009763	0.037581	96.71%	10.889835	98.378% +/- 0.035%	908.06	316.40
q5_0	1408.00	34.4%	0.001284	99.88%	0.000131	0.002039	0.003125	0.008273	0.032387	97.22%	12.836069	98.526% +/- 0.033%	906.42	317.02
q6_0-q4_0	1408.00	34.4%	0.001555	99.85%	0.000113	0.002893	0.004200	0.010392	0.039601	96.52%	11.934310	98.279% +/- 0.036%	909.10	316.04
q6_0-turbo4	1360.00	33.2%	0.001933	99.82%	0.000118	0.003445	0.005001	0.012021	0.044722	96.02%	8.842097	98.076% +/- 0.038%	896.11	320.62
q6_0-turbo3_tcq	1248.00	30.5%	0.003412	99.67%	0.000121	0.006702	0.009642	0.022886	0.089874	91.78%	9.695003	97.394% +/- 0.044%	874.99	328.36
q5_0-turbo4	1232.00	30.1%	0.002122	99.80%	0.000131	0.003913	0.005769	0.013995	0.052315	95.30%	11.286289	97.977% +/- 0.039%	895.90	320.70
q4_0	1152.00	28.1%	0.002748	99.74%	0.000141	0.005244	0.008011	0.019910	0.069373	93.69%	12.890539	97.777% +/- 0.041%	911.14	315.38
turbo3_tcq	832.00	20.3%	0.006038	99.41%	0.000127	0.012714	0.019058	0.046219	0.172480	84.51%	9.415985	96.569% +/- 0.050%	847.47	339.08

IQ4_XS 128k

Cache	KV cache (MiB)	Size vs bf16	Mean KLD	Precision vs bf16	KLD +/-	90% KLD	95% KLD	99% KLD	99.9% KLD	99.9% precision vs bf16	Maximum KLD	Same top p	Tok/s	Elapsed (s)
q8_0	4352.00	53.1%	0.000478	99.95%	0.000006	0.000978	0.001493	0.003968	0.015277	98.49%	0.479234	98.959% +/- 0.028%	705.23	399.42
q8_0-q6_0	3840.00	46.9%	0.000530	99.95%	0.000006	0.001093	0.001652	0.004335	0.017408	98.28%	0.468998	98.900% +/- 0.029%	703.96	400.14
q6_0	3328.00	40.6%	0.000589	99.94%	0.000008	0.001212	0.001852	0.004706	0.019175	98.11%	0.573200	98.855% +/- 0.029%	701.36	401.62
q8_0-turbo4	3232.00	39.5%	0.001554	99.84%	0.000014	0.003335	0.004810	0.011473	0.041006	95.99%	0.633252	98.171% +/- 0.037%	694.04	405.86
q6_0-q5_1	3200.00	39.1%	0.000703	99.93%	0.000011	0.001452	0.002180	0.005544	0.021821	97.85%	1.145905	98.740% +/- 0.031%	701.05	401.80
q6_0-q5_0	3072.00	37.5%	0.000752	99.92%	0.000013	0.001552	0.002311	0.005872	0.022227	97.81%	1.083445	98.700% +/- 0.031%	701.69	401.43
q6_0-q4_1	2944.00	35.9%	0.001191	99.88%	0.000009	0.002583	0.003791	0.008875	0.031032	96.95%	0.431630	98.412% +/- 0.035%	703.37	400.48
q5_0	2816.00	34.4%	0.000928	99.91%	0.000009	0.002009	0.003051	0.007595	0.026940	97.35%	0.495043	98.533% +/- 0.033%	700.96	401.85
q6_0-q4_0	2816.00	34.4%	0.001317	99.87%	0.000010	0.002875	0.004183	0.009784	0.031863	96.87%	0.504899	98.275% +/- 0.036%	704.62	399.76
q6_0-turbo4	2720.00	33.2%	0.001590	99.84%	0.000016	0.003435	0.004979	0.011859	0.039001	96.18%	1.399161	98.063% +/- 0.038%	691.93	407.09
q6_0-turbo3_tcq	2496.00	30.5%	0.003238	99.68%	0.000031	0.006922	0.009933	0.023271	0.087341	91.64%	2.346832	97.388% +/- 0.044%	673.13	418.46
q5_0-turbo4	2464.00	30.1%	0.001809	99.82%	0.000023	0.003891	0.005742	0.013425	0.047328	95.38%	2.025323	98.004% +/- 0.039%	691.62	407.28
q4_0	2304.00	28.1%	0.002264	99.77%	0.000015	0.005074	0.007743	0.018933	0.059927	94.19%	0.455715	97.803% +/- 0.040%	707.36	398.21
turbo3_tcq	1664.00	20.3%	0.005708	99.43%	0.000045	0.012748	0.019441	0.045320	0.150144	86.06%	2.079510	96.591% +/- 0.050%	648.81	434.15

13. q6_0 Follow-Up Analysis

The q6_0 follow-up does not change the shape of the KV-cache ladder. It changes two rungs: q8_0 / q6_0 becomes the better high-end preset, and q6_0 / q5_0 gets added as an optional step between normal q5 and q8 K. Everything else in the q6 sweep is mostly useful as a guardrail: it shows where the extra K bits stop helping, where V gets too weak, and why turbo4 still stays out of the normal recommendations.

13.1 The q6 Slot Is q6 K + q5 V

Symmetric q6_0 is a useful measurement row, but it is not the row the ladder wants. On Q5_K_S 64k, it scores 93.47% at the 99.9% tail, only 0.31 points above the same-run q5_0 control at 93.16%. It costs 40.6% of bf16 KV, so the extra size does not buy much on that run.

The better trade is q6_0 / q5_0. It keeps most of the q6 K benefit, drops the footprint to 37.5%, and is the first q6_0 row that makes sense as an actual preset. But on Q5_K_S the gain over q5_0 / q5_0 is almost a tie: 93.29% vs 93.16% at the 99.9% tail, with mean KLD also nearly tied, 0.002820 vs 0.002826. From this data the benefits are not obvious.

The reason to keep it is the lower-weight-precision run. On IQ4_XS, q6_0 / q5_0 opens a clearer gap over q5_0 / q5_0: 98.08% vs 97.22% at 64k, and 97.81% vs 97.35% at 128k. This does not prove a full mechanism for why the gap changes, but it is enough for a recommendation: if the weights are already compressed and you want more KV headroom than q5 gives, q6 K is the next useful spend.

K / V	Size vs bf16	Q5_K_S 64k 99.9%	IQ4_XS 64k 99.9%	IQ4_XS 128k 99.9%	Read
`q8_0 / q8_0`	53.1%	94.62%	98.73%	98.49%	Fidelity/validation tier
`q8_0 / q6_0`	46.9%	94.33%	98.56%	98.28%	New high-end recommendation
`q6_0 / q6_0`	40.6%	93.47%	98.39%	98.11%	Measurement row, not worth a ladder slot
`q6_0 / q5_0`	37.5%	93.29%	98.08%	97.81%	New optional headroom preset
`q5_0 / q5_0`	34.4%	93.16%	97.22%	97.35%	Still the normal quality preset

13.2 The q5 V Range Is Cheap, But Not Free

The easiest trap in this follow-up is the Q5_K_S tail wobble around q5 V. Symmetric q6_0 scores 93.47%, q6_0 / q5_1 scores 93.50%, and q6_0 / q5_0 scores 93.29%. The 0.03-point "gain" from dropping V to q5_1 is not a result to build a recommendation around. Mean KLD moves the other way: 0.002614 for symmetric q6_0, 0.002781 with q5_1 V, and 0.002820 with q5_0 V.

IQ4_XS makes it less ambiguous. At 64k, the 99.9% tail goes 98.39% → 98.23% → 98.08% as V moves from q6_0 to q5_1 to q5_0. At 128k, it goes 98.11% → 97.85% → 97.81%. So the q5 V range is still a controlled loss. It just happens to be small enough that one Q5 tail column can wobble by more than the real difference between adjacent q5 V tiers.

The useful boundary is below q5. q6_0 / q4_1 drops to 92.19% on Q5_K_S, 1.10 points below q6_0 / q5_0. On IQ4_XS, the same step costs 1.37 points at 64k and 0.86 points at 128k. q6_0 / q4_0 is even easier to reject: it has the same 34.4% footprint as q5_0 / q5_0, but loses on all three follow-up configs.

K / V	Size vs bf16	Q5_K_S 64k 99.9%	IQ4_XS 64k 99.9%	IQ4_XS 128k 99.9%	Decision
`q6_0 / q5_1`	39.1%	93.50%	98.23%	97.85%	Too close to `q6_0 / q6_0` for the size
`q6_0 / q5_0`	37.5%	93.29%	98.08%	97.81%	Keep: best q6 K trade-off
`q6_0 / q4_1`	35.9%	92.19%	96.71%	96.95%	Reject: crosses the V cliff
`q6_0 / q4_0`	34.4%	91.55%	96.52%	96.87%	Reject: same size as `q5_0 / q5_0`, worse result

13.3 turbo4 Still Loses the Tie-Breaker

turbo4 is still not a normal V tier. The confusing row is q6_0 / turbo4: on Q5_K_S, it scores 91.93% at the 99.9% tail, slightly ahead of q6_0 / q4_0 at 91.55%. If that were the only column, it would look like turbo4 V had beaten q4_0 V.

The rest of the row rejects that reading. Mean KLD is worse with turbo4, 0.003748 vs 0.003288, and the IQ4_XS runs move in the same direction at the tail: 96.02% vs 96.52% at 64k, then 96.18% vs 96.87% at 128k. This is the kind of case where the 99.9% tail is still the right warning metric, but mean KLD stops one lucky tail from becoming a recommendation.

The q8 and q5 turbo4 pairs tell the same story. q8_0 / turbo4 gets a tiny Q5_K_S tail edge over q8_0 / q4_0 from the older table, 92.33% vs 92.18%, but loses on IQ4_XS. q5_0 / turbo4 is smaller than q5_0 / q4_0, but the q4 V row is the better recommendation whenever that extra 1.2% of bf16 KV fits.

turbo3_tcq V is below the precision floor for these mixed presets. At q6_0 K, q6_0 / turbo3_tcq lands at 87.68% on Q5_K_S, below the same-run symmetric q4_0 control at 88.87%. That does not remove symmetric turbo3_tcq from the extreme-compression ladder, but it does mean turbo V should not be mixed into ordinary q5/q6/q8 precision rows.

13.4 Updated Preset Ladder

The preset ladder only needs two edits. At the top, q8_0 / q6_0 replaces q8_0 / q5_1. The Q5_K_S gain is small, 94.33% vs 94.21% at the 99.9% tail, but the size cost is also small: 46.9% vs 45.3% of bf16 KV. At this tier, paying 1.6 points of KV size for the stronger V side is cleaner than keeping q5_1 as the main fidelity preset.

In the middle, q6_0 / q5_0 is added as an optional headroom row. It is not a new default. If q5_0 / q5_0 already gives enough quality, stay there. Use q6_0 / q5_0 when you want a stronger K side and cannot justify the jump to q8_0 K.

K / V	% of bf16 size	99.9% precision	What it is for
`bf16 / bf16`	100.0	100.00%	Preserving full quality
`q8_0 / q8_0`	53.1	94.62%	Validation and blame-isolation mode
`q8_0 / q6_0`	46.9	94.33%	Recommended high-end preset
`q8_0 / q5_1`	45.3	94.21%	Fallback if `q6_0` V is unavailable
`q8_0 / q5_0`	43.8	93.69%	If the high-end rows miss the fit by a narrow margin
`q6_0 / q5_0`	37.5	93.29%	Optional headroom tier between q5 and q8 K
`q5_0 / q5_0`	34.4	93.16%	Normal quality preset
`q5_0 / q4_1`	32.8	92.65%	Best default if VRAM-constrained
`q5_0 / q4_0`	31.3	91.39%	If `q5_0 / q4_1` misses the fit by a narrow margin
`q4_0 / q4_0`	28.1	88.87%	Memory saving with visible precision loss
`q4_0 / turbo3_tcq`	24.2	84.93%	Smaller than q4, cleaner than symmetric `turbo3_tcq`
`turbo3_tcq / turbo3_tcq`	20.3	81.56%	Viable extreme-compression mode
`turbo2_tcq / turbo2_tcq`	14.1	54.38%	Last resort: not for code, JSON, math, or tool calls

The rejected q6 rows are not close calls. q6_0 / q5_1 saves too little compared with symmetric q6 and is larger than the useful q6_0 / q5_0 row. q6_0 / q4_1 crosses the V cliff. q6_0 / q4_0 has the same size as q5_0 / q5_0 and loses to it. q6_0 / turbo4 has one attractive Q5_K_S tail ordering, but mean KLD and both IQ4_XS runs reject it. q6_0 / turbo3_tcq belongs below the precision floor, not in the normal ladder.

The preset rule is simple: use q8_0 / q6_0 when you want the high-end cache tier, use q6_0 / q5_0 only when q5_0 / q5_0 feels too tight, and do not chase sub-q5 V just because K is stronger. The q6 follow-up adds one useful middle rung, not a new philosophy for the whole ladder.

14. bf16-K Follow-Up Benchmarks

This follow-up keeps K at bf16 and walks V through the same cache tiers used elsewhere in the article. The point is narrow: these rows isolate how much damage comes from V quantization when the K side is left at the reference format.

Same format as §11.2. KL precision and 99.9% precision are measured against the bf16 / bf16 baseline for each weight/context configuration. Rows are sorted by total KV cache size descending.

Q5_K_S 64k

Cache	KV cache (MiB)	Size vs bf16	Mean KLD	Precision vs bf16	KLD +/-	90% KLD	95% KLD	99% KLD	99.9% KLD	99.9% precision vs bf16	Maximum KLD	Same top p	Tok/s	Elapsed (s)
bf16	4096.00	100.0%	0.000375	100.00%	0.000058	0.000568	0.001693	0.005234	0.023258	100.00%	7.374046	99.647% +/- 0.016%	839.21	342.35
bf16-q8_0	3136.00	76.6%	0.002475	99.79%	0.000171	0.004209	0.006629	0.019275	0.079827	94.50%	14.729765	97.991% +/- 0.039%	850.62	337.75
bf16-q6_0	2880.00	70.3%	0.002393	99.80%	0.000101	0.004302	0.006788	0.019416	0.078853	94.59%	6.564770	97.961% +/- 0.039%	848.99	338.40
bf16-q5_1	2816.00	68.8%	0.002440	99.79%	0.000090	0.004567	0.007171	0.020056	0.096805	92.91%	9.260141	97.903% +/- 0.040%	848.20	338.72
bf16-q5_0	2752.00	67.2%	0.002491	99.79%	0.000128	0.004599	0.007110	0.020664	0.089753	93.57%	13.079553	97.829% +/- 0.040%	848.53	338.58
bf16-q4_1	2688.00	65.6%	0.003221	99.72%	0.000208	0.005661	0.008682	0.023404	0.105923	92.07%	20.044716	97.637% +/- 0.042%	853.30	336.69
bf16-q4_0	2624.00	64.1%	0.003339	99.70%	0.000187	0.005952	0.009136	0.025430	0.110166	91.68%	19.452881	97.594% +/- 0.042%	854.17	336.35
bf16-turbo4	2576.00	62.9%	0.003549	99.68%	0.000163	0.006532	0.009849	0.026539	0.101054	92.52%	11.934117	97.473% +/- 0.043%	843.17	340.74
bf16-turbo3_tcq	2464.00	60.2%	0.005377	99.50%	0.000281	0.009815	0.014509	0.036872	0.146175	88.43%	23.307222	96.962% +/- 0.047%	823.83	348.74

IQ4_XS 64k

Cache	KV cache (MiB)	Size vs bf16	Mean KLD	Precision vs bf16	KLD +/-	90% KLD	95% KLD	99% KLD	99.9% KLD	99.9% precision vs bf16	Maximum KLD	Same top p	Tok/s	Elapsed (s)
bf16	4096.00	100.0%	0.000097	100.00%	0.000020	0.000186	0.000398	0.001062	0.004152	100.00%	2.345056	99.776% +/- 0.013%	910.62	315.50
bf16-q8_0	3136.00	76.6%	0.000587	99.95%	0.000092	0.000924	0.001408	0.004018	0.016148	98.81%	11.615892	98.952% +/- 0.028%	909.80	315.78
bf16-q6_0	2880.00	70.3%	0.000636	99.95%	0.000079	0.001029	0.001556	0.004148	0.017005	98.72%	8.440266	98.875% +/- 0.029%	908.22	316.33
bf16-q5_1	2816.00	68.8%	0.000757	99.93%	0.000074	0.001283	0.001921	0.005045	0.020206	98.41%	8.462649	98.798% +/- 0.030%	907.39	316.62
bf16-q5_0	2752.00	67.2%	0.000873	99.92%	0.000101	0.001384	0.002088	0.005688	0.022394	98.19%	10.605267	98.764% +/- 0.031%	908.46	316.25
bf16-q4_1	2688.00	65.6%	0.001357	99.87%	0.000095	0.002455	0.003607	0.008977	0.034741	96.99%	10.624268	98.392% +/- 0.035%	913.53	314.50
bf16-q4_0	2624.00	64.1%	0.001459	99.86%	0.000073	0.002756	0.004021	0.010121	0.039676	96.51%	6.619313	98.367% +/- 0.035%	914.07	314.31
bf16-turbo4	2576.00	62.9%	0.001770	99.83%	0.000097	0.003286	0.004741	0.011679	0.043127	96.18%	8.782248	98.144% +/- 0.037%	901.27	318.77
bf16-turbo3_tcq	2464.00	60.2%	0.003256	99.68%	0.000092	0.006574	0.009463	0.022278	0.084448	92.28%	8.791702	97.398% +/- 0.044%	879.05	326.83

IQ4_XS 128k

Cache	KV cache (MiB)	Size vs bf16	Mean KLD	Precision vs bf16	KLD +/-	90% KLD	95% KLD	99% KLD	99.9% KLD	99.9% precision vs bf16	Maximum KLD	Same top p	Tok/s	Elapsed (s)
bf16	8192.00	100.0%	0.000000	100.00%	0.000000	0.000015	0.000023	0.000037	0.000051	100.00%	0.000067	99.995% +/- 0.002%	705.78	399.10
bf16-q8_0	6272.00	76.6%	0.000479	99.95%	0.000011	0.000971	0.001473	0.003985	0.014957	98.52%	1.255583	98.953% +/- 0.028%	705.73	399.13
bf16-q6_0	5760.00	70.3%	0.000528	99.95%	0.000011	0.001080	0.001640	0.004261	0.016915	98.33%	1.246464	98.920% +/- 0.029%	703.90	400.17
bf16-q5_1	5632.00	68.8%	0.000643	99.94%	0.000007	0.001318	0.001981	0.005192	0.019604	98.06%	0.459653	98.803% +/- 0.030%	703.75	400.25
bf16-q5_0	5504.00	67.2%	0.000680	99.93%	0.000007	0.001410	0.002094	0.005333	0.020995	97.93%	0.448673	98.758% +/- 0.031%	703.91	400.16
bf16-q4_1	5376.00	65.6%	0.001145	99.89%	0.000011	0.002439	0.003558	0.008389	0.030617	96.99%	0.856124	98.425% +/- 0.034%	709.30	397.13
bf16-q4_0	5248.00	64.1%	0.001295	99.87%	0.000016	0.002773	0.004027	0.009434	0.034897	96.58%	1.481670	98.347% +/- 0.035%	710.30	396.57
bf16-turbo4	5152.00	62.9%	0.001533	99.85%	0.000013	0.003304	0.004779	0.011221	0.041510	95.94%	0.505207	98.119% +/- 0.038%	697.20	404.02
bf16-turbo3_tcq	4928.00	60.2%	0.003191	99.68%	0.000027	0.006856	0.009858	0.023102	0.086196	91.75%	1.071732	97.420% +/- 0.044%	678.44	415.19

15. bf16-K Follow-Up Analysis

15.1 What The bf16-K Rows Test

The useful part of this follow-up is not a new preset ladder, but a V-side isolation check. These pairs keep K at the reference format and make V do all the compression work, so they answer a cleaner question: once attention scores have the strongest K side available in this benchmark, where does V start to become the limiting error?

That also makes these rows awkward as recommendations. A bf16 K side is expensive: at 64k it already takes 2048 MiB before V is counted. Even the smallest row in this set, bf16 / turbo3_tcq, still uses 60.2% of the full bf16 cache. So the entire point is to see which V tiers still matter after K has stopped being the bottleneck.

15.2 q8 V Does Not Earn Full K

bf16 / q8_0 is the cleanest non-recommendation in the table. On Q5_K_S 64k, it scores 94.50% at the 99.9% tail, which is the same broad quality band as symmetric q8_0 / q8_0 at 94.62% and q8_0 / q6_0 at 94.33%. The size is the problem: bf16 / q8_0 uses 76.6% of the full cache, while q8_0 / q8_0 uses 53.1% and q8_0 / q6_0 uses 46.9%.

K / V	Size vs bf16	Q5_K_S 64k 99.9%	Read
`bf16 / q8_0`	76.6%	94.50%	Diagnostic row, not a preset
`q8_0 / q8_0`	53.1%	94.62%	Same tail band at much lower size
`q8_0 / q6_0`	46.9%	94.33%	Recommended high-end preset from §13

The IQ4_XS rows do not rescue the trade. bf16 / q8_0 scores 98.81% at 64k and 98.52% at 128k, but it still costs 76.6% of the full cache. If q8 K is already enough to reach the high-end tail band, keeping K at 16 bits mostly buys comfort for the benchmarker, not a useful user-facing preset.

15.3 q6 V Is The High-End Ceiling

The q6 row shows where the V side stops needing more precision. On Q5_K_S, bf16 / q6_0 scores 94.59%, effectively tied with bf16 / q8_0 at 94.50%. On IQ4_XS, dropping V from q8 to q6 costs 0.09 points at 64k and 0.19 points at 128k. That is small enough to treat q6 V as the ceiling for the practical high-end row.

K / V	Size vs bf16	Q5_K_S 64k	IQ4_XS 64k	IQ4_XS 128k
`bf16 / q8_0`	76.6%	94.50%	98.81%	98.52%
`bf16 / q6_0`	70.3%	94.59%	98.72%	98.33%
`q8_0 / q6_0`	46.9%	94.33%	98.56%	98.28%

The comparison that matters is bf16 / q6_0 against q8_0 / q6_0. The full-K row is 70.3% of bf16 KV, while the q8-K row is 46.9%. On Q5_K_S, that extra 23.4 points of cache size buys 0.26 points of tail precision. On IQ4_XS, it buys 0.16 points at 64k and 0.05 at 128k. That is a validation margin, not a preset margin.

15.4 q5 V Is The Practical Middle

The q5 rows are where V quantization becomes visible without falling out of the normal quality range. On Q5_K_S, bf16 / q5_0 scores 93.57%, one point below bf16 / q6_0. On IQ4_XS, the same drop is smaller: 98.19% at 64k and 97.93% at 128k. That repeats the earlier pattern from §9 and §13: lower-weight-precision runs are less sensitive to same quants.

K / V	Size vs bf16	Q5_K_S 64k	IQ4_XS 64k	IQ4_XS 128k
`bf16 / q6_0`	70.3%	94.59%	98.72%	98.33%
`bf16 / q5_0`	67.2%	93.57%	98.19%	97.93%
`q6_0 / q5_0`	37.5%	93.29%	98.08%	97.81%
`q5_0 / q5_0`	34.4%	93.16%	97.22%	97.35%

The full-K q5 rows are still too large to recommend directly. bf16 / q5_0 uses 67.2% of the cache, while q6_0 / q5_0 uses 37.5% and lands in the same practical band. The q5 V tier is useful, but with q6 or q5 K, not 16-bit K.

15.5 q4 V Is The Drop Full K Cannot Hide

q4 V shows the limit of the full-K diagnostic. On Q5_K_S, bf16 / q4_0 scores 91.68% and bf16 / q4_1 scores 92.07%. Those rows are larger than every normal q8/q6/q5 preset in the ladder, but they still sit below the q5 V range. Once V falls to q4, reference-format K can soften the loss but cannot make the row high end.

The IQ4_XS rows are less harsh but keep the same boundary. At 128k, bf16 / q5_0 is 97.93%, bf16 / q4_1 is 96.99%, and bf16 / q4_0 is 96.58%. That is not a collapse, but it is a clear step down at the same q5-to-q4 V boundary that showed up in the earlier asymmetric tables.

15.6 turbo V Stays Below The Normal Ladder

turbo4 looks better with bf16 K than it looked as a symmetric preset, but that is not enough to make it a normal recommendation. On Q5_K_S, bf16 / turbo4 scores 92.52%, above bf16 / q4_0 and bf16 / q4_1. The price is still 62.9% of the full cache, and the earlier sections already showed why turbo4 loses once other factors are included.

bf16 / turbo3_tcq is clearer. It scores 88.43% on Q5_K_S, 92.28% on IQ4_XS 64k, and 91.75% on IQ4_XS 128k. Full K helps compared with fully compressed turbo rows, but it does not turn turbo V into a q5/q6-quality tier. Turbo modes still belong in the extreme-compression part of the ladder, not in the ordinary precision presets.

15.7 What This Changes

This follow-up does not add any bf16 / V presets and stays exclusively diagnostic. Holding K at bf16 shows the V-side ceiling, but it also makes the rows too large to compete with the q8/q6/q5 ladder.

Observation from bf16-K rows	Preset implication
`bf16 / q8_0` is q8-like in quality but 76.6% of `bf16` KV	Do not use it as a preset; `q8_0 / q8_0` or `q8_0 / q6_0` is cleaner
`bf16 / q6_0` is basically tied with `bf16 / q8_0`	Keep `q8_0 / q6_0` as the high-end recommendation
`bf16 / q5_0` lands close to the q6/q5 rows but at much larger size	Keep `q6_0 / q5_0` as optional headroom and `q5_0 / q5_0` as the normal quality preset
q4 and turbo V remain visibly lower even with reference-format K	Keep them below the normal ladder as memory-saving or extreme-compression options

V has a ceiling around q6, q5 is the practical middle, and below q5 the K side cannot fully bail the row out. That strengthens the section 13 ladder: spend the high-end budget on q8_0 / q6_0, not on bf16 / q8_0 or bf16 / q6_0.

16. bf16 vs f16 Follow-Up Benchmarks

This follow-up runs a narrow Q5_K_S 32k Wikitext path using f32 KL baseline. The point is to answer the smaller question left open by the earlier bf16-baseline tables: which 16-bit KV format between f16 and bf16 stays closer to f32?

The PPL rows compare symmetric f32, f16, and bf16 cache. The KL rows compare those same symmetric rows plus two diagnostic f32-K rows, f32 / f16 and f32 / bf16, which isolate the V side while keeping K at the reference format. Sizes are for total K+V cache at 32k context.

16.1 Perplexity

Q5_K_S 32k

Cache	KV cache (MiB)	Size vs f32	Median PPL	Precision vs f32	PPL +/-	Same top p	Tok/s	Elapsed (s)
f32 / f32	4096.00	100.0%	6.2211	100.00%	0.03986	99.997% +/- 0.001%	690.79	415.90
bf16 / bf16	2048.00	50.0%	6.2147	100.10%	0.03975	97.736% +/- 0.039%	762.98	376.55
f16 / f16	2048.00	50.0%	6.2384	99.72%	0.03999	97.710% +/- 0.039%	754.04	381.02

16.2 KL Divergence

Same idea as §11.2, but measured against a f32 baseline instead of a bf16 baseline. Since the reference row has mean KLD of zero, precision vs f32 is 100 * exp(-meanKLD). Rows are sorted by total KV cache size descending.

Q5_K_S 32k

Cache	KV cache (MiB)	Size vs f32	Mean KLD	Precision vs f32	KLD +/-	90% KLD	95% KLD	99% KLD	99.9% KLD	Maximum KLD	Same top p	Tok/s	Elapsed (s)
f32 / f32	4096.00	100.0%	0.000000	100.00%	0.000000	0.000016	0.000023	0.000037	0.000051	0.000070	99.997% +/- 0.001%	697.71	411.77
f32 / bf16	3072.00	75.0%	0.012111	98.80%	0.000953	0.004531	0.007392	0.026860	1.544926	29.954960	97.827% +/- 0.038%	726.38	395.52
f32 / f16	3072.00	75.0%	0.015022	98.51%	0.001110	0.004961	0.008199	0.030432	2.224951	30.625841	97.735% +/- 0.039%	737.03	389.81
bf16 / bf16	2048.00	50.0%	0.013964	98.61%	0.001065	0.004516	0.007476	0.027545	1.798395	32.009716	97.736% +/- 0.039%	755.24	380.41
f16 / f16	2048.00	50.0%	0.014504	98.56%	0.001086	0.004949	0.008185	0.029429	2.314729	29.698303	97.710% +/- 0.039%	766.66	374.74

17. bf16 vs f16 Follow-Up Analysis

17.1 Why The f32 Baseline Matters

The earlier tables used bf16 as the KL reference because it was the practical uncompressed cache format. That was fine for judging q8, q6, q5, q4, and turbo rows, but it made the f16 vs bf16 question awkward: if bf16 is the baseline, then bf16 automatically gets the cleanest comparison. The f32 follow-up removes that tilt: both 16-bit formats are measured as deviations from the same f32 logit file.

17.2 bf16 Beats f16 At The Same Size

The practical symmetric comparison is direct. bf16 / bf16 and f16 / f16 both use 2048 MiB at 32k context, exactly half of the f32 / f32 cache. On PPL, bf16 lands at 6.2147 while f16 lands at 6.2384. On KL, bf16 has lower mean KLD, 0.013964 vs 0.014504, and a much better 99.9% tail, 1.798395 vs 2.314729.

The V-side isolation rows point the same way. With K held at f32, f32 / bf16 gets mean KLD 0.012111 and 99.9% KLD 1.544926. f32 / f16 gets mean KLD 0.015022 and 99.9% KLD 2.224951. That rules out a softer reading where symmetric bf16 merely got lucky as a whole-cache row, as on this benchmark bf16 V is also cleaner than f16 V.

17.3 Precision Wins, Speed Is Mostly A Wash

There is no useful VRAM trade-off between f16 and bf16. Both are 16-bit cache formats, both use the same 2048 MiB at 32k, and both cut the f32 cache in half. If the hardware supports bf16 properly, the precision result is the deciding factor.

The elapsed times do not give f16 a strong counterargument. In the PPL run, bf16 / bf16 finished in 376.55 seconds and f16 / f16 in 381.02 seconds. In the KL run, bf16 / bf16 took 380.41 seconds and f16 / f16 took 374.74 seconds. The mixed rows show the same small wobble: f32 / bf16 took 395.52 seconds and f32 / f16 took 389.81 seconds. That is not enough to justify taking the worse KLD row as a speed preset.

17.4 What This Changes

Observation from f32-baseline rows	KV cache implication
`bf16 / bf16` beats `f16 / f16` on PPL, mean KLD, and 99.9% KLD at the same cache size	Use `bf16` as the normal 16-bit KV format when the backend supports it
`f32 / bf16` also beats `f32 / f16`	The bf16 edge is not only a whole-cache artifact, it shows up when V is isolated too
`f16` has no VRAM advantage over `bf16` and no stable elapsed-time win in this run	Do not prefer `f16` for KV cache unless a specific backend handles `bf16` poorly

This resolves the practical f16 vs bf16 KV-cache question for this benchmark: bf16 is the better 16-bit default. It has the same memory footprint as f16, no clear speed penalty in these runs, and lower divergence from the f32 baseline.

Back to articles