Qwen3.6-27B EAGLE-3 Drafter

EAGLE-3 speculative-decoding drafter for Qwen3.6-27B β€” a small (~0.6 B trainable) draft head that makes the 27 B target decode losslessly faster. Trained for the abliterated Qwen3.6-27B-PRISM-PRO; also validated on stock Qwen/Qwen3.6-27B.

Tuned for long-horizon, agentic work β€” not just shallow chat. The self-distillation corpus deliberately oversamples long-context, multi-turn, and agentic / tool-use traces (terminal, IDE, coding, function-calling, with long and extra-long sequence buckets) β€” so the drafter's acceptance holds up on deep, long-running generation, not only short single-turn prompts.

Two variants are provided:

variant draft vocab size use
full/ 248 320 (full) 3.1 GB resume-training base; widest compatibility
compressed/ 32 000 (+ d2t map) 1.1 GB recommended for serving β€” fastest

Performance

Single-stream greedy decode, BF16 target, SGLang 0.5.12. EAGLE-3 speculative decoding is lossless β€” output is token-identical to non-spec greedy decode (modulo batched-verify floating-point non-associativity, intrinsic to all speculative decoding).

target engine Β· drafter no-spec tok/s EAGLE-3 tok/s speedup
Qwen3.6-27B-PRISM-PRO SGLang Β· compressed drafter 93 183 1.97Γ—
Qwen3.6-27B-PRISM-PRO SGLang Β· full drafter 93 171 1.84Γ—
Qwen/Qwen3.6-27B (stock) SGLang Β· compressed drafter 93 171 1.84Γ—
Qwen3.6-27B-PRISM-PRO vLLM Β· full drafter 90 130 1.44Γ—

Accept length (EAGLE Ο„ β€” mean tokens emitted per target forward) is 2.4 on PRISM-PRO and 2.2 on stock Qwen3.6-27B. Tree drafting (--speculative-eagle-topk 4) raises accept length to ~3.35 but is throughput-neutral on this hybrid GatedDeltaNet target (the tree-build + recurrent-verify cost cancels the acceptance gain) β€” chain is recommended.

Quick start β€” SGLang

SGLang's dense Qwen3.6 model class (Qwen3_5ForConditionalGeneration) ships DFlash/MTP aux-hidden capture but not the EAGLE-3 target hook. A one-shot, idempotent patch adds it (two methods, qwen3_5.py only β€” see below).

pip install "sglang>=0.5.12"

# 1. patch SGLang for EAGLE-3 on Qwen3.6 (auto-locates your install, writes a .bak)
python3 patch_sglang_eagle3.py

# 2. serve at the validated fast config (compressed drafter, chain)
MODEL=Qwen/Qwen3.6-27B DRAFTER=./compressed ./serve_sglang.sh

Or launch directly:

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-27B \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path ./compressed \
  --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
  --mamba-scheduler-strategy extra_buffer \
  --mm-attention-backend sdpa \
  --dtype bfloat16 --trust-remote-code

Key flags (all required for a hybrid GatedDeltaNet target):

  • --mamba-scheduler-strategy extra_buffer + SGLANG_ENABLE_SPEC_V2=1 β€” the V2 scheduler; SGLang rejects spec decoding + radix cache without it.
  • --mm-attention-backend sdpa β€” the default fa4 vision kernel asserts SM 10.x/11.x and rejects newer Blackwell (sm_103); sdpa is arch-agnostic.

The patch

patch_sglang_eagle3.py adds set_eagle3_layers_to_capture to Qwen3_5ForCausalLM (the dense decoder) and overrides it on Qwen3_5ForConditionalGeneration (the multimodal wrapper) so EAGLE-3 aux-hidden capture routes through the Qwen3.6 decoder's per-layer _is_layer_to_capture mechanism. Without it, the inherited Qwen3-VL hook sets a layers_to_capture list the Qwen3.6 decoder never reads β€” capture silently no-ops and the forward pass crashes. The patch is anchor-based, idempotent (keys on an [EAGLE3-PATCH] marker), writes a .eagle3-bak backup, AST-validates the result, and is verified against SGLang 0.5.12 and main. It touches qwen3_5.py only.

Architecture

Standard EAGLE-3 drafter: 1 decoder layer + an fc layer fusing auxiliary hidden states from target layers [1, 31, 60] (low/mid/high), sharing the target's embed_tokens and lm_head. architectures: ["LlamaForCausalLMEagle3"], hidden_size 5120. It runs against any Qwen3.6-27B target β€” output correctness is gated entirely on the target's own argmax (the verify step), so the drafter is purely a speed lever and can never change what the target would have produced.

Continue training

To push accept length past ~3.3 (e.g. tree-aware / parallel-draft objectives), warm-start from full/ β€” a standard LlamaForCausalLMEagle3 EAGLE-3 checkpoint loadable by SpecForge or NVIDIA TensorRT Model-Optimizer (which trained this drafter). It was self-distilled on Qwen3.6-27B-PRISM-PRO completions (REAP + UltraChat + tulu-3 corpus). The current drafter is chain-trained (parallel_draft_step: 1); tree-aware retraining is the main lever for further gains.

Files

full/        config.json + model.safetensors   β€” 248K-vocab drafter
compressed/  config.json + model.safetensors   β€” 32K-vocab drafter (+ d2t)
patch_sglang_eagle3.py                          β€” one-shot SGLang EAGLE-3 patch
serve_sglang.sh                                 β€” validated fast-config launcher

License

Apache-2.0. Derived from Qwen/Qwen3.6-27B (Apache-2.0).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Ex0bit/Qwen3.6-27B-PRISM-EAGLE3

Base model

Qwen/Qwen3.6-27B
Finetuned
(182)
this model