Anbeeld

Projects Articles Support Contact

Anbeeld's BeeLlama.cpp

GitHub

A performance-focused llama.cpp fork with DFlash for faster generation and longer context.

BeeLlama.cpp logo

BeeLlama.cpp (or just Bee) is a performance-focused llama.cpp fork for squeezing more speed and context out of local GGUF inference. It keeps the familiar llama.cpp tools and server flow, then adds DFlash speculative decoding, adaptive draft control, TurboQuant/TCQ KV-cache compression, and reasoning-loop protection, with full multimodal support.

Not quite a pegasus, but close enough.

Here's a plug-and-play Qwen 3.6 27B setup with a config to run it in Q5 + 200k of practically lossless KV cache + vision on a single RTX 3090 or 4090.

Fork Features

For the full feature and public-repo comparison, read docs/beellama-features.md. For the complete argument reference, read docs/beellama-args.md.

TurboQuant (WHT-based scalar quantization) originates from TheTom/llama-cpp-turboquant. TCQ (Trellis-Coded Quantization) and basic DFlash implementation originate from spiritbuun/buun-llama-cpp (paper: Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits).

DFlash Speedup

Here's your typical "write in Python" best-case ceiling benchmark with Qwen 3.6 27B using Q4_K_M drafter on a single RTX 3090 24GB. Like any other speculative prediction, DFlash is strongest on structured, repetitive generation: code, tests, boilerplate, JSON-like formats, and other low-entropy continuations.

TaskModelOutputBaselineBee DFlashSpeedupAcceptance
Linked listQ4_K_M~1.2K tok39.2 tok/s130.1 tok/s3.32x49.1% / 84.5%
Linked listQ5_K_S~1.2K tok36.5 tok/s135.8 tok/s3.72x47.8% / 85.8%
Cache libraryQ4_K_M~3.6K tok37.5 tok/s91.5 tok/s2.44x40.5% / 78.8%
Cache libraryQ5_K_S~3.6K tok35.9 tok/s83.7 tok/s2.33x36.7% / 76.2%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens.

This is not a claim about all workloads. DFlash can go much faster on highly predictable code generation than on normal chat. Open-ended prose is much less predictable, so gains are smaller.

On the bright side, adaptive draft-max will track how much DFlash is helping on the current task and will adjust its intensity accordingly, or even turn it off if you would dip below the baseline otherwise.

TurboQuant / TCQ cache

TypebpvDiffQuality vs f16/q8_0Practical verdict
turbo44.1253.88xBest scalar TurboQuant quality tier. Available tests show minimal measurable degradation vs f16/q8_0.Best safe scalar compression target, especially for V cache.
turbo3_tcq3.254.92xStrongest 3-bit quality. The TCQ docs report 10-44% KL reduction over scalar 2-3 bit quantization and lower PPL than FP16 in one Qwen3.5-27B result: 5.802 vs 5.805.Best high-compression quality-aware option.
turbo33.1255.12xStrong compression with measurable quality cost. Available tests put it below turbo4 but still usable on tolerant models/configs.Aggressive scalar compression. Validate per model, especially if used on K.
turbo2_tcq2.257.11xBest 2-bit option. Per the TCQ docs it significantly improves 2-bit quantization and closes much of the gap with 3-bit scalar methods.Extreme compression with better quality story than scalar 2-bit.
turbo22.1257.53xExtreme scalar compression. Highest quality risk among scalar TurboQuant types.Emergency context/VRAM mode. Prefer as a last resort V-only.

TurboQuant is not truly lossless at any point, but on the higher end it might very well be practically lossless for most tasks. Especially when one's practicality is heavily influenced by VRAM constraints and how to get the most out of it.

Installation

Quickstart: DFlash on a Single GPU

For a step-by-step walkthrough with Qwen 3.6 on a 24 GB NVIDIA card (RTX 3090/4090, etc.), see docs/quickstart-qwen36-dflash.md. It covers model download, prebuilt binaries, and a tuned launch command.

Prebuilt (Windows)

Download the release archive for your CUDA version (12.4 or 13.1) from the releases page. Extract it. The server binary is llama-server.exe. Don't forget to download a separate archive with CUDA libraries and place it in the same folder!

Building from source with -DGGML_NATIVE=ON may result in a tiny bit better performance, so it might still be a good idea to do that if/when you decide to use this fork long-term.

CUDA Build

# Linux (GCC + CUDA)
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON \
  -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# Windows (MSVC + CUDA)
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON ^
  -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON ^
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --parallel

# macOS (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

GGML_CUDA_FA_ALL_QUANTS=ON is required for TurboQuant and TCQ cache types. Add -DCMAKE_CUDA_ARCHITECTURES=86 for RTX 3090, or -DCMAKE_CUDA_ARCHITECTURES=89 for RTX 4090, if cross-compiling or building in CI without a GPU.

Other Backends

Bee inherits llama.cpp backend support, including Metal, HIP, Vulkan, SYCL, BLAS, CANN, MUSA, OpenVINO, OpenCL, and RPC. Use the upstream-style build docs in docs/build.md and backend-specific pages under docs/backend.

Common Commands

Local CLI

llama-cli -m model.gguf
llama-cli -m model.gguf -cnv --chat-template chatml
llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p "Request: schedule a call at 8pm; Command:"

OpenAI-Compatible Server

llama-server -m model.gguf --port 8080
llama-server -m model.gguf -c 16384 -np 4
llama-server -m model.gguf -md draft.gguf

DFlash And TurboQuant Together

llama-server -m target.gguf --spec-type dflash \
  --spec-draft-model drafter.gguf \
  --spec-draft-ngl all \
  --flash-attn on --cache-type-k turbo4 --cache-type-v turbo3_tcq

Documentation

Contributing

Keep PRs small and scoped. Run the narrowest relevant tests or benchmarks before opening a PR, and include the exact commands. For fork-specific speculative decoding, DFlash, TurboQuant, or reasoning-loop changes, update the corresponding docs when behavior or args change.

Read CONTRIBUTING.md for inherited llama.cpp contribution conventions and this fork's AI usage policy.

Dependencies

See the licenses/ directory for full license texts.

Back to projects