MoE efficiency vs uncensored deployment: practical trade-offs

#31
by O96a - opened

The 0/465 refusals claim is compelling for research use cases where refusal triggers interfere with legitimate tasks. The 35B MoE with ~3B active parameters is an efficient profile for local deployment.

A few practical questions:

  1. The aggressive uncensoring β€” have you observed any degradation in reasoning benchmarks compared to the base Qwen3.5-35B-A3B? The technical trade-off is often between uncensoring and instruction-following precision.

  2. For the MoE architecture (256 experts, 8 routed + 1 shared), what VRAM are you seeing for the Q4_K_M quant at 20GB? With 3B active parameters, this should fit on consumer GPUs, but curious about the mmproj overhead for multimodal.

  3. The Hybrid DeltaNet + softmax attention β€” have you benchmarked throughput compared to pure attention? Linear attention should help with long-context (262K native), but curious about the real-world latency.

Useful to see imatrix-preserved quants across the full range. For production deployments needing full uncensored output, the Q5_K_M at 24GB seems like the sweet spot for quality/size.

Sign up or log in to comment