Qwen3.6-27B EAGLE-3 Drafter

EAGLE-3 speculative-decoding drafter for Qwen3.6-27B — a small (~0.6 B trainable) draft head that makes the 27 B target decode losslessly faster. Trained for the abliterated Qwen3.6-27B-PRISM-PRO; also validated on stock Qwen/Qwen3.6-27B.

Tuned for long-horizon, agentic work — not just shallow chat. The self-distillation corpus deliberately oversamples long-context, multi-turn, and agentic / tool-use traces (terminal, IDE, coding, function-calling, with long and extra-long sequence buckets) — so the drafter's acceptance holds up on deep, long-running generation, not only short single-turn prompts.

Two variants are provided:

variant	draft vocab	size	use
`full/`	248 320 (full)	3.1 GB	resume-training base; widest compatibility
`compressed/`	32 000 (+ `d2t` map)	1.1 GB	recommended for serving — fastest

Performance

Single-stream greedy decode, BF16 target, SGLang 0.5.12. EAGLE-3 speculative decoding is lossless — output is token-identical to non-spec greedy decode (modulo batched-verify floating-point non-associativity, intrinsic to all speculative decoding).

target	engine · drafter	no-spec tok/s	EAGLE-3 tok/s	speedup
Qwen3.6-27B-PRISM-PRO	SGLang · compressed drafter	93	183	1.97×
Qwen3.6-27B-PRISM-PRO	SGLang · full drafter	93	171	1.84×
`Qwen/Qwen3.6-27B` (stock)	SGLang · compressed drafter	93	171	1.84×
Qwen3.6-27B-PRISM-PRO	vLLM · full drafter	90	130	1.44×

Accept length (EAGLE τ — mean tokens emitted per target forward) is 2.4 on PRISM-PRO and 2.2 on stock Qwen3.6-27B. Tree drafting (--speculative-eagle-topk 4) raises accept length to ~3.35 but is throughput-neutral on this hybrid GatedDeltaNet target (the tree-build + recurrent-verify cost cancels the acceptance gain) — chain is recommended.

Quick start — SGLang

SGLang's dense Qwen3.6 model class (Qwen3_5ForConditionalGeneration) ships DFlash/MTP aux-hidden capture but not the EAGLE-3 target hook. A one-shot, idempotent patch adds it (two methods, qwen3_5.py only — see below).

pip install "sglang>=0.5.12"

# 1. patch SGLang for EAGLE-3 on Qwen3.6 (auto-locates your install, writes a .bak)
python3 patch_sglang_eagle3.py

# 2. serve at the validated fast config (compressed drafter, chain)
MODEL=Qwen/Qwen3.6-27B DRAFTER=./compressed ./serve_sglang.sh

Or launch directly:

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-27B \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path ./compressed \
  --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
  --mamba-scheduler-strategy extra_buffer \
  --mm-attention-backend sdpa \
  --dtype bfloat16 --trust-remote-code

Key flags (all required for a hybrid GatedDeltaNet target):

--mamba-scheduler-strategy extra_buffer + SGLANG_ENABLE_SPEC_V2=1 — the V2 scheduler; SGLang rejects spec decoding + radix cache without it.
--mm-attention-backend sdpa — the default fa4 vision kernel asserts SM 10.x/11.x and rejects newer Blackwell (sm_103); sdpa is arch-agnostic.

The patch

patch_sglang_eagle3.py adds set_eagle3_layers_to_capture to Qwen3_5ForCausalLM (the dense decoder) and overrides it on Qwen3_5ForConditionalGeneration (the multimodal wrapper) so EAGLE-3 aux-hidden capture routes through the Qwen3.6 decoder's per-layer _is_layer_to_capture mechanism. Without it, the inherited Qwen3-VL hook sets a layers_to_capture list the Qwen3.6 decoder never reads — capture silently no-ops and the forward pass crashes. The patch is anchor-based, idempotent (keys on an [EAGLE3-PATCH] marker), writes a .eagle3-bak backup, AST-validates the result, and is verified against SGLang 0.5.12 and main. It touches qwen3_5.py only.

Architecture

Standard EAGLE-3 drafter: 1 decoder layer + an fc layer fusing auxiliary hidden states from target layers [1, 31, 60] (low/mid/high), sharing the target's embed_tokens and lm_head. architectures: ["LlamaForCausalLMEagle3"], hidden_size 5120. It runs against any Qwen3.6-27B target — output correctness is gated entirely on the target's own argmax (the verify step), so the drafter is purely a speed lever and can never change what the target would have produced.

Continue training

To push accept length past ~3.3 (e.g. tree-aware / parallel-draft objectives), warm-start from full/ — a standard LlamaForCausalLMEagle3 EAGLE-3 checkpoint loadable by SpecForge or NVIDIA TensorRT Model-Optimizer (which trained this drafter). It was self-distilled on Qwen3.6-27B-PRISM-PRO completions (REAP + UltraChat + tulu-3 corpus). The current drafter is chain-trained (parallel_draft_step: 1); tree-aware retraining is the main lever for further gains.

Files

full/        config.json + model.safetensors   — 248K-vocab drafter
compressed/  config.json + model.safetensors   — 32K-vocab drafter (+ d2t)
patch_sglang_eagle3.py                          — one-shot SGLang EAGLE-3 patch
serve_sglang.sh                                 — validated fast-config launcher

License

Apache-2.0. Derived from Qwen/Qwen3.6-27B (Apache-2.0).

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Ex0bit/Qwen3.6-27B-PRISM-EAGLE3

Base model

Qwen/Qwen3.6-27B

Finetuned

(182)

this model