Speculative Config - MTP Crash related to quantized expert names

#1
by seanthomaswilliams - opened

I'm seeing a crash when enabling MTP speculative decoding with the official GPTQ checkpoint:

  • Model: Qwen/Qwen3.5-35B-A3B-GPTQ-Int4
  • vLLM version: v0.16.0rc2.dev447

Command:

vllm serve Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 \
  --quantization moe_wna16 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

Crash snippet:

KeyError: 'layers.0.mlp.experts.w2_weight'
.../vllm/model_executor/models/qwen3_5_mtp.py", line 286, in <...>
  • The checkpoint does include MTP weights (785 keys, including mtp.fc.weight, mtp.layers.0.*, etc.).
  • The config's dynamic exclude list excludes mtp from quantization.
  • However, the MTP drafter loader path (Qwen3_5MoeMTP) still expects unquantized expert names like w2_weight for referenced base MoE layers.
  • With --quantization moe_wna16, expert tensors are represented with quantized naming/components (qweight, qzeros, scales, g_idx) instead.

So this appears to be a naming gap in the MTP weight loader for quantized expert layouts, rather than missing MTP tensors in the checkpoint.

Same for me

Sign up or log in to comment