MoE efficiency vs uncensored deployment: practical trade-offs
The 0/465 refusals claim is compelling for research use cases where refusal triggers interfere with legitimate tasks. The 35B MoE with ~3B active parameters is an efficient profile for local deployment.
A few practical questions:
The aggressive uncensoring β have you observed any degradation in reasoning benchmarks compared to the base Qwen3.5-35B-A3B? The technical trade-off is often between uncensoring and instruction-following precision.
For the MoE architecture (256 experts, 8 routed + 1 shared), what VRAM are you seeing for the Q4_K_M quant at 20GB? With 3B active parameters, this should fit on consumer GPUs, but curious about the mmproj overhead for multimodal.
The Hybrid DeltaNet + softmax attention β have you benchmarked throughput compared to pure attention? Linear attention should help with long-context (262K native), but curious about the real-world latency.
Useful to see imatrix-preserved quants across the full range. For production deployments needing full uncensored output, the Q5_K_M at 24GB seems like the sweet spot for quality/size.