AWQ 4-bit version of this Opus-Distilled-v2 model?
Hi,
Thank you for your excellent NVFP4 quantizations.
I'm using Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 (the v2 version with 14k Opus samples). It's currently the best reasoning model I have for coding and agent tasks - shorter CoT, better efficiency than base Qwen3.5-27B.
However, I'm on a single RTX 5090 and really want to run it with vLLM + FlashInfer to get MTP, continuous batching and higher speed.
Would you consider making an AWQ 4-bit version of this Opus-Distilled-v2 model?
The distillation dataset is public, so the data is already available. Many users with 40/50-series cards are waiting for a good AWQ quant of this specific model.
Thanks in advance!
Best regards
omw
https://huggingface.co/mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit
Let me know if there are any errors or problems
https://huggingface.co/mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit
Let me know if there are any errors or problems
Huge thanks for the lightning-fast!
Just 1 day after the request.
Now I can finally run this beast with vLLM + FlashInfer β MTP + continuous batching. Going from ~45 tok/s (GGUF) to potentially 150+ tok/s on a single 5090.
This is why open source is unbeatable. ππ₯
I'll provide feedback right away if I run into any issues.
Do you have any suggestions or tips before I start testing it with vLLM + FlashInfer?
https://huggingface.co/mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-AWQ-4bit
Let me know if there are any errors or problems
@mconcat Thank you for your sharing!!! Would you consider making an 4B/9B AWQ 4bit model? Many Thanks!! I can run it on my laptop, lol