Nemotron-Cascade-2-30B-A3B-NVFP4
NVFP4 (4-bit) quantization of Nemotron-Cascade-2-30B-A3B. Quantized with NVIDIA ModelOpt using the same selective recipe as NVIDIA's official Nano NVFP4: MoE experts and Mamba GEMMs in NVFP4 (E2M1 with block scaling), attention and sensitive layers in BF16, KV cache in FP8. Native FP4 compute on Blackwell; weight-only dequant on Hopper/Ampere.
Benchmarks
Calculated using NVIDIA-NeMo/Evaluator with config from Nemotron-3-Super-120B's eval config:
| Benchmark | Nemotron-Cascade-2-30B-A3B (reproduced results) |
Nemotron-Cascade-2-30B-A3B-NVFP4 (this model) |
|---|---|---|
| AIME 2025 (avg@8) | 98.8 | 97.9 |
| AIME 2026 (avg@8) | 94.2 | 92.1 |
| HMMT Feb 2025 (avg@8) | 92.9 | 90.1 |
With the low sample count (8 rollouts per problem), a deviation of ±2% accross runs is expected. The NVFP4 is consistently 1-2% below the original BF16.
Quantization Details
- Method: NVFP4 Post-Training Quantization (PTQ), without Quantization-Aware Distillation (QAD)
- Format: E2M1 (1 sign, 2 exponent, 1 mantissa bit) with hierarchical block scaling
- Block scaling: Group size 16 — each block of 16 values shares an FP8 E4M3 scale, plus a per-tensor FP32 global scale
- KV cache: FP8
- Tooling: NVIDIA ModelOpt
Selective Quantization Recipe
Follows the Nano-architecture selective quantization recipe from the Nemotron 3 Nano Technical Report (Section 4). Same recipe as NVIDIA's official NVFP4 checkpoint. Sensitive components are kept in higher precision:
| Component | Precision | Rationale |
|---|---|---|
| MoE expert GEMMs (routed + shared) | NVFP4 | All 23 MoE layers, 128 routed + 2 shared experts each |
| Mamba GEMMs (non-adjacent) | NVFP4 | 17 of 23 Mamba layers |
| Attention layers (all 6) | BF16 | Most sensitive — kept BF16 per NVIDIA sensitivity analysis |
| Mamba layers adjacent to attention (6) | BF16 | Layers {4, 11, 18, 25, 32, 41} — found sensitive in ablations |
| Mamba 1D conv | BF16 | All layers |
| Router gates | FP32 | Routing precision must not degrade |
| Embeddings & lm_head | BF16 | Not quantized |
| KV cache | FP8 | All 6 attention layers |
Calibration
- Dataset: 4,000 samples from nvidia/Nemotron-Cascade-2-SFT-Data
- Domain mix: math (1000), swe (900), terminal_agent (500), science (500), chat (400), conversational_agent (300), instruction_following (300), safety (100)
- Sequence length: Up to 12,288 tokens (no padding, natural length per sample)
Usage
SGLang
python -m sglang.launch_server \
--model chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
--trust-remote-code \
--tool-call-parser qwen3_coder \
--reasoning-parser nano_v3
vLLM
vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
--mamba_ssm_cache_dtype float32 \
--max-model-len 262144 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_v3 \
--kv-cache-dtype fp8
GPU Requirements
| Architecture | GPU Examples | FP4 Support |
|---|---|---|
| Blackwell (SM100+) | B200, RTX 5090 | Native W4A4 — full compute speedup |
| Hopper (SM90) | H100, L40S | Weight-only dequantization at runtime |
| Ampere (SM86) | RTX 3090, A100 | Not supported |
Native FP4 Tensor Core compute requires Blackwell GPUs. On older architectures, weights are stored in FP4 but dequantized to FP16/BF16 at runtime — you still get the VRAM savings but not the compute speedup.
Acknowledgments
- Quantization recipe based on the Nemotron 3 Nano Technical Report
- Quantized with NVIDIA ModelOpt
- Downloads last month
- 8,303
Model tree for chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4
Base model
nvidia/Nemotron-Cascade-2-30B-A3B