MoE efficiency vs uncensored deployment: practical trade-offs

#31

by O96a - opened about 15 hours ago

The 0/465 refusals claim is compelling for research use cases where refusal triggers interfere with legitimate tasks. The 35B MoE with ~3B active parameters is an efficient profile for local deployment.

A few practical questions:

The aggressive uncensoring — have you observed any degradation in reasoning benchmarks compared to the base Qwen3.5-35B-A3B? The technical trade-off is often between uncensoring and instruction-following precision.
For the MoE architecture (256 experts, 8 routed + 1 shared), what VRAM are you seeing for the Q4_K_M quant at 20GB? With 3B active parameters, this should fit on consumer GPUs, but curious about the mmproj overhead for multimodal.
The Hybrid DeltaNet + softmax attention — have you benchmarked throughput compared to pure attention? Linear attention should help with long-context (262K native), but curious about the real-world latency.

Useful to see imatrix-preserved quants across the full range. For production deployments needing full uncensored output, the Q5_K_M at 24GB seems like the sweet spot for quality/size.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment