Speculative Config - MTP Crash related to quantized expert names
#1
by seanthomaswilliams - opened
I'm seeing a crash when enabling MTP speculative decoding with the official GPTQ checkpoint:
- Model: Qwen/Qwen3.5-35B-A3B-GPTQ-Int4
- vLLM version: v0.16.0rc2.dev447
Command:
vllm serve Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 \
--quantization moe_wna16 \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'
Crash snippet:
KeyError: 'layers.0.mlp.experts.w2_weight'
.../vllm/model_executor/models/qwen3_5_mtp.py", line 286, in <...>
- The checkpoint does include MTP weights (785 keys, including
mtp.fc.weight,mtp.layers.0.*, etc.). - The config's dynamic exclude list excludes mtp from quantization.
- However, the MTP drafter loader path (
Qwen3_5MoeMTP) still expects unquantized expert names like w2_weight for referenced base MoE layers. - With
--quantization moe_wna16, expert tensors are represented with quantized naming/components (qweight,qzeros,scales,g_idx) instead.
So this appears to be a naming gap in the MTP weight loader for quantized expert layouts, rather than missing MTP tensors in the checkpoint.
Same for me