Anbeeld

Projects Articles Support Contact

Anbeeld's BeeLlama.cpp

GitHub

A performance-focused llama.cpp fork with DFlash for faster generation and longer context.

BeeLlama.cpp logo

BeeLlama.cpp (or just Bee) is a performance-focused llama.cpp fork for squeezing more speed and context out of local GGUF inference. It keeps the familiar llama.cpp tools and server flow, then adds DFlash speculative decoding, adaptive draft control, TurboQuant/TCQ KV-cache compression, and reasoning-loop protection, with full multimodal support.

Not quite a pegasus, but close enough.

Plug-and-Play Setups

Support my work!

Fork Features

For the full feature and public-repo comparison, read docs/beellama-features.md. For the complete argument reference, read docs/beellama-args.md.

TurboQuant (WHT-based scalar quantization) and the TQ3_1S / TQ4_1S weight formats originate from TheTom/llama-cpp-turboquant. TCQ (Trellis-Coded Quantization) and basic DFlash implementation originate from spiritbuun/buun-llama-cpp (paper: Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits).

DFlash Speedup

DFlash is strongest on structured, repetitive generation: code, tests, boilerplate, JSON-like formats, and other low-entropy continuations. Open-ended prose is much less predictable, so gains are smaller.

Qwen 3.6 27B

Target model: Qwen 3.6 27B Q5_K_S or Qwen 3.6 27B MTP Q5_K_S. DFlash model: Q4_K_M.

PromptServerOutputMedianBestSpeedupAcceptance
Task store moduleBaseline~1K tok37.2 tok/s37.2 tok/s1.00xN/A
Task store moduleDFlash~1K tok163.9 tok/s181.9 tok/s4.40x67.7% / 89.2%
Task store moduleMTP~1K tok69.3 tok/s69.6 tok/s1.86x92.0% / 73.3%
KV report moduleBaseline~1K tok34.6 tok/s36.5 tok/s1.00xN/A
KV report moduleDFlash~1K tok157.7 tok/s162.5 tok/s4.56x58.8% / 88.9%
KV report moduleMTP~1K tok67.3 tok/s68.1 tok/s1.94x89.3% / 73.0%
Doubly-linked listBaseline~4K tok36.8 tok/s36.9 tok/s1.00xN/A
Doubly-linked listDFlash~4K tok130.8 tok/s154.1 tok/s3.56x50.4% / 86.8%
Doubly-linked listMTP~4K tok66.3 tok/s68.0 tok/s1.80x87.8% / 72.5%
Prompt processingBaseline~20K tok1229.5 tok/s1229.5 tok/s1.00xN/A
Prompt processingDFlash~20K tok1214.4 tok/s1221.7 tok/s0.99xN/A
Prompt processingMTP~20K tok1162.6 tok/s1164.7 tok/s0.95xN/A
Multi-turn codingBaseline~28K tok33.3 tok/s33.3 tok/s1.00xN/A
Multi-turn codingDFlash~30K tok64.6 tok/s65.4 tok/s1.94x24.9% / 72.9%
Multi-turn codingMTP~34K tok56.5 tok/s56.5 tok/s1.70x71.9% / 68.3%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

Gemma 4 31B

Target model: Gemma 4 31B Q4_K_S. DFlash model: Q5_K_M.

PromptServerOutputMedianBestSpeedupAcceptance
Task store moduleBaseline~1K tok36.1 tok/s36.1 tok/s1.00xN/A
Task store moduleDFlash~1K tok177.8 tok/s182.0 tok/s4.93x65.7% / 90.0%
KV report moduleBaseline~1K tok35.9 tok/s36.0 tok/s1.00xN/A
KV report moduleDFlash~1K tok154.3 tok/s162.8 tok/s4.29x55.7% / 88.6%
Doubly-linked listBaseline~1.9K tok36.0 tok/s36.0 tok/s1.00xN/A
Doubly-linked listDFlash~1.9K tok116.6 tok/s127.3 tok/s3.24x44.5% / 84.9%
Prompt processingBaseline~24K tok1021.3 tok/s1021.3 tok/s1.00xN/A
Prompt processingDFlash~24K tok954.5 tok/s954.9 tok/s0.93xN/A
Multi-turn codingBaseline~12K tok34.8 tok/s34.8 tok/s1.00xN/A
Multi-turn codingDFlash~12K tok60.6 tok/s64.1 tok/s1.74x24.4% / 72.3%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

KV Cache Quantization

K and V cache types are set independently with --cache-type-k and --cache-type-v. For the preset rationale and benchmark details, see KV Cache Quantization Benchmarks for Long Context.

Preset Ladder

K / V% of bf16 size99.9% precisionWhat it is for
bf16 / bf16100.0100.00%Preserving full quality
q8_0 / q8_053.194.62%Validation and blame-isolation mode
q8_0 / q6_046.994.33%Recommended high-end preset
q8_0 / q5_145.394.21%Fallback if q6_0 V is unavailable
q8_0 / q5_043.893.69%If the high-end rows miss the fit by a narrow margin
q6_0 / q5_037.593.29%Optional headroom tier between q5 and q8 K
q5_0 / q5_034.493.16%Normal quality preset
q5_0 / q4_132.892.65%Best default if VRAM-constrained
q5_0 / q4_031.391.39%If q5_0 / q4_1 misses the fit by a narrow margin
q4_0 / q4_028.188.87%Memory saving with visible precision loss
q4_0 / turbo3_tcq24.284.93%Smaller than q4, cleaner than symmetric turbo3_tcq
turbo3_tcq / turbo3_tcq20.381.56%Viable extreme-compression mode
turbo2_tcq / turbo2_tcq14.154.38%Last resort: not for code, JSON, math, or tool calls

99.9% precision = 100 · exp(−(quantKLD − bf16KLD)) at the 99.9% KL-divergence tail.

Type Reference

TypeOriginbpvDiff vs bf16Notes
q8_0upstream8.51.88×High-fidelity K or V
q6_0upstream6.52.46×Robust type for high-end presets
q5_1upstream62.67×Conservative, might be better for V than q5_0
q5_0upstream5.52.91×Strong K type for VRAM constrained configs
q4_1upstream53.2×Smaller than q5_0, but weaker in the tail. Prefer q5_0 for K
q4_0upstream4.53.56×Default high compression type, decent at its size
turbo4fork4.1253.88×Barely smaller than q4_0, slower, worse tail
turbo3_tcqfork3.254.92×Viable compact mode, 82% precision at KLD 99.9%. CUDA-only
turbo3fork3.1255.12×Weaker than turbo3_tcq. Use only when TCQ is unavailable
turbo2_tcqfork2.257.11×Last resort, 54% precision at KLD 99.9%. CUDA-only
turbo2fork2.1257.53×Extreme quality risk. Use only when TCQ is unavailable

Installation

Plug-and-Play Setups

Prebuilt

Current release binaries are on the releases page:

PlatformBackendArchive
macOS arm64Metalbin-macos-arm64.tar.gz
Ubuntu x64CPUbin-ubuntu-x64.tar.gz
Ubuntu arm64CPUbin-ubuntu-arm64.tar.gz
Ubuntu x64CUDA 12.4bin-ubuntu-cuda-12.4-x64.tar.gz
Ubuntu x64CUDA 13.1bin-ubuntu-cuda-13.1-x64.tar.gz
Ubuntu x64Vulkanbin-ubuntu-vulkan-x64.tar.gz
Ubuntu x64ROCm 7.2bin-ubuntu-rocm-7.2-x64.tar.gz
Ubuntu x64SYCLbin-ubuntu-sycl-x64.tar.gz
Windows x64CPUbin-win-cpu-x64.zip
Windows x64SYCLbin-win-sycl-x64.zip
Windows x64CUDA 12.4bin-win-cuda-12.4-x64.zip
Windows x64CUDA 13.1bin-win-cuda-13.1-x64.zip
Windows x64HIP/Radeonbin-win-hip-radeon-x64.zip

Windows CUDA archives contain a ggml-cuda.dll backend; download the matching cudart-win-cuda-*-x64.zip runtime archive and extract it into the same folder. Windows SYCL and HIP archives ship as standalone packages with all required runtime DLLs bundled.

Docker images are published to ghcr.io/anbeeld/beellama.cpp:

ImageAccelerationPlatforms
server, server-cpuCPUlinux/amd64, linux/arm64
server-cuda, server-cuda12CUDA 12.4linux/amd64
server-cuda13CUDA 13.1linux/amd64
server-rocmROCmlinux/amd64
server-vulkanVulkanlinux/amd64
server-syclSYCLlinux/amd64

Building from source with -DGGML_NATIVE=ON may result in a tiny bit better performance, so it might still be a good idea to do that if/when you decide to use this fork long-term.

CUDA Build

# Linux (GCC + CUDA)
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON \
  -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# Windows (MSVC + CUDA)
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON ^
  -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON ^
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --parallel

# macOS (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

GGML_CUDA_FA_ALL_QUANTS=ON is required for TurboQuant and TCQ cache types. Add -DCMAKE_CUDA_ARCHITECTURES=86 for RTX 3090, or -DCMAKE_CUDA_ARCHITECTURES=89 for RTX 4090, if cross-compiling or building in CI without a GPU.

Other Backends

Bee inherits llama.cpp backend support, including Metal, HIP, Vulkan, SYCL, BLAS, CANN, MUSA, OpenVINO, OpenCL, and RPC. Use the upstream-style build docs in docs/build.md and backend-specific pages under docs/backend.

Common Commands

Local CLI

llama-cli -m model.gguf
llama-cli -m model.gguf -cnv --chat-template chatml
llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p "Request: schedule a call at 8pm; Command:"

OpenAI-Compatible Server

llama-server -m model.gguf --port 8080
llama-server -m model.gguf -c 16384 -np 4
llama-server -m model.gguf -md draft.gguf

DFlash And TurboQuant Together

llama-server -m target.gguf --spec-type dflash \
  --spec-draft-model drafter.gguf \
  --spec-draft-ngl all \
  --flash-attn on --cache-type-k turbo4 --cache-type-v turbo3_tcq

Documentation

Contributing

Keep PRs small and scoped. Run the narrowest relevant tests or benchmarks before opening a PR, and include the exact commands. For fork-specific speculative decoding, DFlash, TurboQuant, or reasoning-loop changes, update the corresponding docs when behavior or args change.

Read CONTRIBUTING.md for inherited llama.cpp contribution conventions and this fork's AI usage policy.

Dependencies

See the licenses/ directory for full license texts.

Back to projects