Nemotron-Cascade-2-30B-A3B-NVFP4

NVFP4 (4-bit) quantization of Nemotron-Cascade-2-30B-A3B. Quantized with NVIDIA ModelOpt using the same selective recipe as NVIDIA's official Nano NVFP4: MoE experts and Mamba GEMMs in NVFP4 (E2M1 with block scaling), attention and sensitive layers in BF16, KV cache in FP8. Native FP4 compute on Blackwell; weight-only dequant on Hopper/Ampere.

Benchmarks

Calculated using NVIDIA-NeMo/Evaluator with config from Nemotron-3-Super-120B's eval config:

Benchmark	Nemotron-Cascade-2-30B-A3B (reproduced results)	Nemotron-Cascade-2-30B-A3B-NVFP4 (this model)
AIME 2025 (avg@8)	98.8	97.9
AIME 2026 (avg@8)	94.2	92.1
HMMT Feb 2025 (avg@8)	92.9	90.1

With the low sample count (8 rollouts per problem), a deviation of ±2% accross runs is expected. The NVFP4 is consistently 1-2% below the original BF16.

Quantization Details

Method: NVFP4 Post-Training Quantization (PTQ), without Quantization-Aware Distillation (QAD)
Format: E2M1 (1 sign, 2 exponent, 1 mantissa bit) with hierarchical block scaling
Block scaling: Group size 16 — each block of 16 values shares an FP8 E4M3 scale, plus a per-tensor FP32 global scale
KV cache: FP8
Tooling: NVIDIA ModelOpt

Selective Quantization Recipe

Follows the Nano-architecture selective quantization recipe from the Nemotron 3 Nano Technical Report (Section 4). Same recipe as NVIDIA's official NVFP4 checkpoint. Sensitive components are kept in higher precision:

Component	Precision	Rationale
MoE expert GEMMs (routed + shared)	NVFP4	All 23 MoE layers, 128 routed + 2 shared experts each
Mamba GEMMs (non-adjacent)	NVFP4	17 of 23 Mamba layers
Attention layers (all 6)	BF16	Most sensitive — kept BF16 per NVIDIA sensitivity analysis
Mamba layers adjacent to attention (6)	BF16	Layers {4, 11, 18, 25, 32, 41} — found sensitive in ablations
Mamba 1D conv	BF16	All layers
Router gates	FP32	Routing precision must not degrade
Embeddings & lm_head	BF16	Not quantized
KV cache	FP8	All 6 attention layers

Calibration

Dataset: 4,000 samples from nvidia/Nemotron-Cascade-2-SFT-Data
Domain mix: math (1000), swe (900), terminal_agent (500), science (500), chat (400), conversational_agent (300), instruction_following (300), safety (100)
Sequence length: Up to 12,288 tokens (no padding, natural length per sample)

Usage

SGLang

python -m sglang.launch_server \
    --model chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
    --trust-remote-code \
    --tool-call-parser qwen3_coder \
    --reasoning-parser nano_v3

vLLM

vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
    --mamba_ssm_cache_dtype float32 \
    --max-model-len 262144 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser nemotron_v3 \
    --kv-cache-dtype fp8

GPU Requirements

Architecture	GPU Examples	FP4 Support
Blackwell (SM100+)	B200, RTX 5090	Native W4A4 — full compute speedup
Hopper (SM90)	H100, L40S	Weight-only dequantization at runtime
Ampere (SM86)	RTX 3090, A100	Not supported

Native FP4 Tensor Core compute requires Blackwell GPUs. On older architectures, weights are stored in FP4 but dequantized to FP16/BF16 at runtime — you still get the VRAM savings but not the compute speedup.

Acknowledgments

Quantization recipe based on the Nemotron 3 Nano Technical Report
Quantized with NVIDIA ModelOpt

Downloads last month: 8,303

Safetensors

Model size

16B params

Tensor type

F32

BF16

F8_E4M3

Model tree for chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4

Base model

nvidia/Nemotron-Cascade-2-30B-A3B

Quantized

(28)

this model

chankhavu
/

Nemotron-Cascade-2-30B-A3B-NVFP4