Nemotron-Cascade-2-30B-A3B-NVFP4

NVFP4 (4-bit) quantization of Nemotron-Cascade-2-30B-A3B. Quantized with NVIDIA ModelOpt using the same selective recipe as NVIDIA's official Nano NVFP4: MoE experts and Mamba GEMMs in NVFP4 (E2M1 with block scaling), attention and sensitive layers in BF16, KV cache in FP8. Native FP4 compute on Blackwell; weight-only dequant on Hopper/Ampere.

Benchmarks

Calculated using NVIDIA-NeMo/Evaluator with config from Nemotron-3-Super-120B's eval config:

Benchmark Nemotron-Cascade-2-30B-A3B
(reproduced results)
Nemotron-Cascade-2-30B-A3B-NVFP4
(this model)
AIME 2025 (avg@8) 98.8 97.9
AIME 2026 (avg@8) 94.2 92.1
HMMT Feb 2025 (avg@8) 92.9 90.1

With the low sample count (8 rollouts per problem), a deviation of ±2% accross runs is expected. The NVFP4 is consistently 1-2% below the original BF16.

Quantization Details

  • Method: NVFP4 Post-Training Quantization (PTQ), without Quantization-Aware Distillation (QAD)
  • Format: E2M1 (1 sign, 2 exponent, 1 mantissa bit) with hierarchical block scaling
  • Block scaling: Group size 16 — each block of 16 values shares an FP8 E4M3 scale, plus a per-tensor FP32 global scale
  • KV cache: FP8
  • Tooling: NVIDIA ModelOpt

Selective Quantization Recipe

Follows the Nano-architecture selective quantization recipe from the Nemotron 3 Nano Technical Report (Section 4). Same recipe as NVIDIA's official NVFP4 checkpoint. Sensitive components are kept in higher precision:

Component Precision Rationale
MoE expert GEMMs (routed + shared) NVFP4 All 23 MoE layers, 128 routed + 2 shared experts each
Mamba GEMMs (non-adjacent) NVFP4 17 of 23 Mamba layers
Attention layers (all 6) BF16 Most sensitive — kept BF16 per NVIDIA sensitivity analysis
Mamba layers adjacent to attention (6) BF16 Layers {4, 11, 18, 25, 32, 41} — found sensitive in ablations
Mamba 1D conv BF16 All layers
Router gates FP32 Routing precision must not degrade
Embeddings & lm_head BF16 Not quantized
KV cache FP8 All 6 attention layers

Calibration

  • Dataset: 4,000 samples from nvidia/Nemotron-Cascade-2-SFT-Data
  • Domain mix: math (1000), swe (900), terminal_agent (500), science (500), chat (400), conversational_agent (300), instruction_following (300), safety (100)
  • Sequence length: Up to 12,288 tokens (no padding, natural length per sample)

Usage

SGLang

python -m sglang.launch_server \
    --model chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
    --trust-remote-code \
    --tool-call-parser qwen3_coder \
    --reasoning-parser nano_v3

vLLM

vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
    --mamba_ssm_cache_dtype float32 \
    --max-model-len 262144 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser nemotron_v3 \
    --kv-cache-dtype fp8

GPU Requirements

Architecture GPU Examples FP4 Support
Blackwell (SM100+) B200, RTX 5090 Native W4A4 — full compute speedup
Hopper (SM90) H100, L40S Weight-only dequantization at runtime
Ampere (SM86) RTX 3090, A100 Not supported

Native FP4 Tensor Core compute requires Blackwell GPUs. On older architectures, weights are stored in FP4 but dequantized to FP16/BF16 at runtime — you still get the VRAM savings but not the compute speedup.

Acknowledgments

Downloads last month
8,303
Safetensors
Model size
16B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4

Quantized
(28)
this model

Dataset used to train chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4