Qwen3.6-27B EAGLE-3 Drafter
EAGLE-3 speculative-decoding drafter for Qwen3.6-27B β a small (~0.6 B
trainable) draft head that makes the 27 B target decode losslessly faster.
Trained for the abliterated Qwen3.6-27B-PRISM-PRO; also validated on stock
Qwen/Qwen3.6-27B.
Tuned for long-horizon, agentic work β not just shallow chat. The self-distillation corpus deliberately oversamples long-context, multi-turn, and agentic / tool-use traces (terminal, IDE, coding, function-calling, with long and extra-long sequence buckets) β so the drafter's acceptance holds up on deep, long-running generation, not only short single-turn prompts.
Two variants are provided:
| variant | draft vocab | size | use |
|---|---|---|---|
full/ |
248 320 (full) | 3.1 GB | resume-training base; widest compatibility |
compressed/ |
32 000 (+ d2t map) |
1.1 GB | recommended for serving β fastest |
Performance
Single-stream greedy decode, BF16 target, SGLang 0.5.12. EAGLE-3 speculative decoding is lossless β output is token-identical to non-spec greedy decode (modulo batched-verify floating-point non-associativity, intrinsic to all speculative decoding).
| target | engine Β· drafter | no-spec tok/s | EAGLE-3 tok/s | speedup |
|---|---|---|---|---|
| Qwen3.6-27B-PRISM-PRO | SGLang Β· compressed drafter | 93 | 183 | 1.97Γ |
| Qwen3.6-27B-PRISM-PRO | SGLang Β· full drafter | 93 | 171 | 1.84Γ |
Qwen/Qwen3.6-27B (stock) |
SGLang Β· compressed drafter | 93 | 171 | 1.84Γ |
| Qwen3.6-27B-PRISM-PRO | vLLM Β· full drafter | 90 | 130 | 1.44Γ |
Accept length (EAGLE Ο β mean tokens emitted per target forward) is 2.4 on
PRISM-PRO and 2.2 on stock Qwen3.6-27B. Tree drafting (--speculative-eagle-topk 4)
raises accept length to ~3.35 but is throughput-neutral on this hybrid
GatedDeltaNet target (the tree-build + recurrent-verify cost cancels the
acceptance gain) β chain is recommended.
Quick start β SGLang
SGLang's dense Qwen3.6 model class (Qwen3_5ForConditionalGeneration) ships
DFlash/MTP aux-hidden capture but not the EAGLE-3 target hook. A one-shot,
idempotent patch adds it (two methods, qwen3_5.py only β see below).
pip install "sglang>=0.5.12"
# 1. patch SGLang for EAGLE-3 on Qwen3.6 (auto-locates your install, writes a .bak)
python3 patch_sglang_eagle3.py
# 2. serve at the validated fast config (compressed drafter, chain)
MODEL=Qwen/Qwen3.6-27B DRAFTER=./compressed ./serve_sglang.sh
Or launch directly:
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
--model-path Qwen/Qwen3.6-27B \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path ./compressed \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--mamba-scheduler-strategy extra_buffer \
--mm-attention-backend sdpa \
--dtype bfloat16 --trust-remote-code
Key flags (all required for a hybrid GatedDeltaNet target):
--mamba-scheduler-strategy extra_buffer+SGLANG_ENABLE_SPEC_V2=1β the V2 scheduler; SGLang rejects spec decoding + radix cache without it.--mm-attention-backend sdpaβ the defaultfa4vision kernel asserts SM 10.x/11.x and rejects newer Blackwell (sm_103);sdpais arch-agnostic.
The patch
patch_sglang_eagle3.py adds set_eagle3_layers_to_capture to
Qwen3_5ForCausalLM (the dense decoder) and overrides it on
Qwen3_5ForConditionalGeneration (the multimodal wrapper) so EAGLE-3 aux-hidden
capture routes through the Qwen3.6 decoder's per-layer _is_layer_to_capture
mechanism. Without it, the inherited Qwen3-VL hook sets a layers_to_capture
list the Qwen3.6 decoder never reads β capture silently no-ops and the forward
pass crashes. The patch is anchor-based, idempotent (keys on an [EAGLE3-PATCH]
marker), writes a .eagle3-bak backup, AST-validates the result, and is verified
against SGLang 0.5.12 and main. It touches qwen3_5.py only.
Architecture
Standard EAGLE-3 drafter: 1 decoder layer + an fc layer fusing auxiliary
hidden states from target layers [1, 31, 60] (low/mid/high), sharing the
target's embed_tokens and lm_head. architectures: ["LlamaForCausalLMEagle3"],
hidden_size 5120. It runs against any Qwen3.6-27B target β output correctness
is gated entirely on the target's own argmax (the verify step), so the drafter
is purely a speed lever and can never change what the target would have produced.
Continue training
To push accept length past ~3.3 (e.g. tree-aware / parallel-draft objectives),
warm-start from full/ β a standard LlamaForCausalLMEagle3 EAGLE-3
checkpoint loadable by SpecForge or
NVIDIA TensorRT Model-Optimizer
(which trained this drafter). It was self-distilled on Qwen3.6-27B-PRISM-PRO
completions (REAP + UltraChat + tulu-3 corpus). The current drafter is
chain-trained (parallel_draft_step: 1); tree-aware retraining is the main lever
for further gains.
Files
full/ config.json + model.safetensors β 248K-vocab drafter
compressed/ config.json + model.safetensors β 32K-vocab drafter (+ d2t)
patch_sglang_eagle3.py β one-shot SGLang EAGLE-3 patch
serve_sglang.sh β validated fast-config launcher
License
Apache-2.0. Derived from Qwen/Qwen3.6-27B (Apache-2.0).
Model tree for Ex0bit/Qwen3.6-27B-PRISM-EAGLE3
Base model
Qwen/Qwen3.6-27B