Home › Projects › BeeLlama.cpp
A performance-focused llama.cpp fork with DFlash for faster generation and longer context.

BeeLlama.cpp (or just Bee) is a performance-focused llama.cpp fork for squeezing more speed and context out of local GGUF inference. It keeps the familiar llama.cpp tools and server flow, then adds DFlash speculative decoding, adaptive draft control, TurboQuant/TCQ KV-cache compression, and reasoning-loop protection, with full multimodal support.
Not quite a pegasus, but close enough.
--spec-type dflash drives a DFlash draft GGUF alongside the target model. The target captures hidden states into a ring buffer, the drafter cross-attends to the most recent --spec-dflash-cross-ctx hidden-state tokens and proposes drafts for target verification.--spec-draft-n-max. The default profit controller compares speculative throughput against a no-spec baseline, the fringe alternative maps acceptance-rate bands to draft depth.--mmproj is active, the server keeps DFlash available for text generation. The model can be fully offloaded to CPU with no problems to reduce VRAM pressure.turbo2, turbo3, turbo4, turbo2_tcq, turbo3_tcq) spanning from 4x to 7.5x compression, with TCQ types offering good precision for their size. Set independently with --cache-type-k and --cache-type-v.TQ3_1S and TQ4_1S are available through llama-quantize with non-conflicting GGML type IDs 47 and 48. These are GGUF weight formats, not KV-cache types; backend acceleration should be validated before claiming it for a deployment.force-close, with --reasoning-loop-window and --reasoning-loop-max-period tuning available.--spec-draft-temp enables rejection-sampling drafter behavior. Activates when both draft and target temperature exceed zero. Draft log probabilities must be available for rejection sampling to produce correct output.--spec-branch-budget adds branch nodes beyond the main draft path with GPU parent_ids, tree masks, and recurrent tree kernels. Disabled automatically when the target model spans more than one GPU. This one is very much work in progress!--spec-type copyspec provides rolling-hash suffix matching over previous tokens without a draft model.For the full feature and public-repo comparison, read docs/beellama-features.md. For the complete argument reference, read docs/beellama-args.md.
TurboQuant (WHT-based scalar quantization) and the TQ3_1S / TQ4_1S weight formats originate from TheTom/llama-cpp-turboquant. TCQ (Trellis-Coded Quantization) and basic DFlash implementation originate from spiritbuun/buun-llama-cpp (paper: Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits).
DFlash is strongest on structured, repetitive generation: code, tests, boilerplate, JSON-like formats, and other low-entropy continuations. Open-ended prose is much less predictable, so gains are smaller.
Target model: Qwen 3.6 27B Q5_K_S or Qwen 3.6 27B MTP Q5_K_S. DFlash model: Q4_K_M.
| Prompt | Server | Output | Median | Best | Speedup | Acceptance |
|---|---|---|---|---|---|---|
| Task store module | Baseline | ~1K tok | 37.2 tok/s | 37.2 tok/s | 1.00x | N/A |
| Task store module | DFlash | ~1K tok | 163.9 tok/s | 181.9 tok/s | 4.40x | 67.7% / 89.2% |
| Task store module | MTP | ~1K tok | 69.3 tok/s | 69.6 tok/s | 1.86x | 92.0% / 73.3% |
| KV report module | Baseline | ~1K tok | 34.6 tok/s | 36.5 tok/s | 1.00x | N/A |
| KV report module | DFlash | ~1K tok | 157.7 tok/s | 162.5 tok/s | 4.56x | 58.8% / 88.9% |
| KV report module | MTP | ~1K tok | 67.3 tok/s | 68.1 tok/s | 1.94x | 89.3% / 73.0% |
| Doubly-linked list | Baseline | ~4K tok | 36.8 tok/s | 36.9 tok/s | 1.00x | N/A |
| Doubly-linked list | DFlash | ~4K tok | 130.8 tok/s | 154.1 tok/s | 3.56x | 50.4% / 86.8% |
| Doubly-linked list | MTP | ~4K tok | 66.3 tok/s | 68.0 tok/s | 1.80x | 87.8% / 72.5% |
| Prompt processing | Baseline | ~20K tok | 1229.5 tok/s | 1229.5 tok/s | 1.00x | N/A |
| Prompt processing | DFlash | ~20K tok | 1214.4 tok/s | 1221.7 tok/s | 0.99x | N/A |
| Prompt processing | MTP | ~20K tok | 1162.6 tok/s | 1164.7 tok/s | 0.95x | N/A |
| Multi-turn coding | Baseline | ~28K tok | 33.3 tok/s | 33.3 tok/s | 1.00x | N/A |
| Multi-turn coding | DFlash | ~30K tok | 64.6 tok/s | 65.4 tok/s | 1.94x | 24.9% / 72.9% |
| Multi-turn coding | MTP | ~34K tok | 56.5 tok/s | 56.5 tok/s | 1.70x | 71.9% / 68.3% |
Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens
Target model: Gemma 4 31B Q4_K_S. DFlash model: Q5_K_M.
| Prompt | Server | Output | Median | Best | Speedup | Acceptance |
|---|---|---|---|---|---|---|
| Task store module | Baseline | ~1K tok | 36.1 tok/s | 36.1 tok/s | 1.00x | N/A |
| Task store module | DFlash | ~1K tok | 177.8 tok/s | 182.0 tok/s | 4.93x | 65.7% / 90.0% |
| KV report module | Baseline | ~1K tok | 35.9 tok/s | 36.0 tok/s | 1.00x | N/A |
| KV report module | DFlash | ~1K tok | 154.3 tok/s | 162.8 tok/s | 4.29x | 55.7% / 88.6% |
| Doubly-linked list | Baseline | ~1.9K tok | 36.0 tok/s | 36.0 tok/s | 1.00x | N/A |
| Doubly-linked list | DFlash | ~1.9K tok | 116.6 tok/s | 127.3 tok/s | 3.24x | 44.5% / 84.9% |
| Prompt processing | Baseline | ~24K tok | 1021.3 tok/s | 1021.3 tok/s | 1.00x | N/A |
| Prompt processing | DFlash | ~24K tok | 954.5 tok/s | 954.9 tok/s | 0.93x | N/A |
| Multi-turn coding | Baseline | ~12K tok | 34.8 tok/s | 34.8 tok/s | 1.00x | N/A |
| Multi-turn coding | DFlash | ~12K tok | 60.6 tok/s | 64.1 tok/s | 1.74x | 24.4% / 72.3% |
Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens
K and V cache types are set independently with --cache-type-k and --cache-type-v. For the preset rationale and benchmark details, see KV Cache Quantization Benchmarks for Long Context.
| K / V | % of bf16 size | 99.9% precision | What it is for |
|---|---|---|---|
| bf16 / bf16 | 100.0 | 100.00% | Preserving full quality |
| q8_0 / q8_0 | 53.1 | 94.62% | Validation and blame-isolation mode |
| q8_0 / q6_0 | 46.9 | 94.33% | Recommended high-end preset |
| q8_0 / q5_1 | 45.3 | 94.21% | Fallback if q6_0 V is unavailable |
| q8_0 / q5_0 | 43.8 | 93.69% | If the high-end rows miss the fit by a narrow margin |
| q6_0 / q5_0 | 37.5 | 93.29% | Optional headroom tier between q5 and q8 K |
| q5_0 / q5_0 | 34.4 | 93.16% | Normal quality preset |
| q5_0 / q4_1 | 32.8 | 92.65% | Best default if VRAM-constrained |
| q5_0 / q4_0 | 31.3 | 91.39% | If q5_0 / q4_1 misses the fit by a narrow margin |
| q4_0 / q4_0 | 28.1 | 88.87% | Memory saving with visible precision loss |
| q4_0 / turbo3_tcq | 24.2 | 84.93% | Smaller than q4, cleaner than symmetric turbo3_tcq |
| turbo3_tcq / turbo3_tcq | 20.3 | 81.56% | Viable extreme-compression mode |
| turbo2_tcq / turbo2_tcq | 14.1 | 54.38% | Last resort: not for code, JSON, math, or tool calls |
99.9% precision = 100 · exp(−(quantKLD − bf16KLD)) at the 99.9% KL-divergence tail.
| Type | Origin | bpv | Diff vs bf16 | Notes |
|---|---|---|---|---|
| q8_0 | upstream | 8.5 | 1.88× | High-fidelity K or V |
| q6_0 | upstream | 6.5 | 2.46× | Robust type for high-end presets |
| q5_1 | upstream | 6 | 2.67× | Conservative, might be better for V than q5_0 |
| q5_0 | upstream | 5.5 | 2.91× | Strong K type for VRAM constrained configs |
| q4_1 | upstream | 5 | 3.2× | Smaller than q5_0, but weaker in the tail. Prefer q5_0 for K |
| q4_0 | upstream | 4.5 | 3.56× | Default high compression type, decent at its size |
| turbo4 | fork | 4.125 | 3.88× | Barely smaller than q4_0, slower, worse tail |
| turbo3_tcq | fork | 3.25 | 4.92× | Viable compact mode, 82% precision at KLD 99.9%. CUDA-only |
| turbo3 | fork | 3.125 | 5.12× | Weaker than turbo3_tcq. Use only when TCQ is unavailable |
| turbo2_tcq | fork | 2.25 | 7.11× | Last resort, 54% precision at KLD 99.9%. CUDA-only |
| turbo2 | fork | 2.125 | 7.53× | Extreme quality risk. Use only when TCQ is unavailable |
Current release binaries are on the releases page:
| Platform | Backend | Archive |
|---|---|---|
| macOS arm64 | Metal | bin-macos-arm64.tar.gz |
| Ubuntu x64 | CPU | bin-ubuntu-x64.tar.gz |
| Ubuntu arm64 | CPU | bin-ubuntu-arm64.tar.gz |
| Ubuntu x64 | CUDA 12.4 | bin-ubuntu-cuda-12.4-x64.tar.gz |
| Ubuntu x64 | CUDA 13.1 | bin-ubuntu-cuda-13.1-x64.tar.gz |
| Ubuntu x64 | Vulkan | bin-ubuntu-vulkan-x64.tar.gz |
| Ubuntu x64 | ROCm 7.2 | bin-ubuntu-rocm-7.2-x64.tar.gz |
| Ubuntu x64 | SYCL | bin-ubuntu-sycl-x64.tar.gz |
| Windows x64 | CPU | bin-win-cpu-x64.zip |
| Windows x64 | SYCL | bin-win-sycl-x64.zip |
| Windows x64 | CUDA 12.4 | bin-win-cuda-12.4-x64.zip |
| Windows x64 | CUDA 13.1 | bin-win-cuda-13.1-x64.zip |
| Windows x64 | HIP/Radeon | bin-win-hip-radeon-x64.zip |
Windows CUDA archives contain a ggml-cuda.dll backend; download the matching cudart-win-cuda-*-x64.zip runtime archive and extract it into the same folder. Windows SYCL and HIP archives ship as standalone packages with all required runtime DLLs bundled.
Docker images are published to ghcr.io/anbeeld/beellama.cpp:
| Image | Acceleration | Platforms |
|---|---|---|
server, server-cpu | CPU | linux/amd64, linux/arm64 |
server-cuda, server-cuda12 | CUDA 12.4 | linux/amd64 |
server-cuda13 | CUDA 13.1 | linux/amd64 |
server-rocm | ROCm | linux/amd64 |
server-vulkan | Vulkan | linux/amd64 |
server-sycl | SYCL | linux/amd64 |
Building from source with -DGGML_NATIVE=ON may result in a tiny bit better performance, so it might still be a good idea to do that if/when you decide to use this fork long-term.
# Linux (GCC + CUDA)
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON \
-DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# Windows (MSVC + CUDA)
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON ^
-DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON ^
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --parallel
# macOS (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
GGML_CUDA_FA_ALL_QUANTS=ON is required for TurboQuant and TCQ cache types. Add -DCMAKE_CUDA_ARCHITECTURES=86 for RTX 3090, or -DCMAKE_CUDA_ARCHITECTURES=89 for RTX 4090, if cross-compiling or building in CI without a GPU.
Bee inherits llama.cpp backend support, including Metal, HIP, Vulkan, SYCL, BLAS, CANN, MUSA, OpenVINO, OpenCL, and RPC. Use the upstream-style build docs in docs/build.md and backend-specific pages under docs/backend.
llama-cli -m model.gguf
llama-cli -m model.gguf -cnv --chat-template chatml
llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p "Request: schedule a call at 8pm; Command:"
llama-server -m model.gguf --port 8080
llama-server -m model.gguf -c 16384 -np 4
llama-server -m model.gguf -md draft.gguf
llama-server -m target.gguf --spec-type dflash \
--spec-draft-model drafter.gguf \
--spec-draft-ngl all \
--flash-attn on --cache-type-k turbo4 --cache-type-v turbo3_tcq
Keep PRs small and scoped. Run the narrowest relevant tests or benchmarks before opening a PR, and include the exact commands. For fork-specific speculative decoding, DFlash, TurboQuant, or reasoning-loop changes, update the corresponding docs when behavior or args change.
Read CONTRIBUTING.md for inherited llama.cpp contribution conventions and this fork's AI usage policy.
llama-server - MITcommon/suffix-tree.*, common/int32-map.h) - Apache-2.0ggml/src/ggml-openvino/openvino/frontend.h) - Apache-2.0ggml/src/ggml-sycl/) - Apache-2.0 WITH LLVM-exceptionSee the licenses/ directory for full license texts.