Speculative Config - MTP Crash related to quantized expert names

by seanthomaswilliams - opened Mar 4

Mar 4

•

I'm seeing a crash when enabling MTP speculative decoding with the official GPTQ checkpoint:

Command:

vllm serve Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 \
  --quantization moe_wna16 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

Crash snippet:

KeyError: 'layers.0.mlp.experts.w2_weight'
.../vllm/model_executor/models/qwen3_5_mtp.py", line 286, in <...>

The checkpoint does include MTP weights (785 keys, including mtp.fc.weight, mtp.layers.0.*, etc.).
The config's dynamic exclude list excludes mtp from quantization.
However, the MTP drafter loader path (Qwen3_5MoeMTP) still expects unquantized expert names like w2_weight for referenced base MoE layers.
With --quantization moe_wna16, expert tensors are represented with quantized naming/components (qweight, qzeros, scales, g_idx) instead.

So this appears to be a naming gap in the MTP weight loader for quantized expert layouts, rather than missing MTP tensors in the checkpoint.

Mar 5

Same for me

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment