MiniMax-M2.7 AWQ-G32-STRIX-2H

MiniMax-M2.7-AWQ-G32-STRIX-2H is a mixed-precision AWQ quantization of amd/MiniMax-M2.7-BF16, built for two-node AMD Strix Halo (gfx1151) inference with vLLM + Ray tensor parallelism.

The quantization recipe keeps attention, routing, embeddings, normalization, and the final four main-model MoE expert layers in BF16, while quantizing the bulk of the MoE expert weights to INT4 W4A16 AWQ with group size 32. The goal is to preserve long-context behavior and reasoning quality while fitting MiniMax-M2.7 into a 2× Strix Halo deployment target.

Model details

Field Value
Public name MiniMax-M2.7-AWQ-G32-STRIX-2H
Suggested vLLM served name minimax-m2-7-awq-g32-strix-2h
Base model amd/MiniMax-M2.7-BF16
Base revision 92d4d55827de5231e493f0cf6e66e1b255749592
Quantization format compressed-tensors AWQ metadata + safetensors
Weight precision Mixed BF16 + INT4 W4A16
INT4 group size 32
Estimated model size in memory ~145 GiB / 155.27 GB
Safetensors shards 32
Target runtime vLLM + Ray tensor parallelism on 2× Strix Halo
Configured/tested max context 196,608 tokens
Estimated memory-budget context ceiling ~230K-280K tokens for one active sequence, depending on runtime overhead and how much Strix Halo UMA is exposed to ROCm/vLLM

Intended deployment target

This quant is intended for a two-system AMD Strix Halo setup:

  • 2× Strix Halo / gfx1151 GPUs
  • vLLM OpenAI-compatible serving
  • Ray distributed executor
  • tensor parallel size 2
  • ROCm-based runtime

It is not designed for single-GPU Strix Halo serving. The model size and long-context KV cache budget assume tensor parallelism across two Strix Halo systems.

Strix Halo vLLM setup

For the Strix Halo ROCm/vLLM environment, use the Strix Halo vLLM toolbox:

That project provides a Strix Halo-oriented vLLM container/toolbox environment for AMD Ryzen AI Max / Strix Halo (gfx1151) systems. The model-specific settings used for this quant are listed below.

vLLM launch example

vllm serve /path/to/model \
  --served-model-name minimax-m2-7-awq-g32-strix-2h \
  --host 127.0.0.1 \
  --port "${PORT:-8000}" \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --enforce-eager \
  --gpu-memory-utilization 0.92 \
  --max-model-len 196608 \
  --max-num-seqs 2 \
  --max-num-batched-tokens 20480 \
  --dtype auto \
  --load-format safetensors \
  --trust-remote-code \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --enable-auto-tool-choice \
  --override-generation-config '{"max_new_tokens": 10000, "temperature": 0.2, "top_p": 0.9, "repetition_penalty": 1.08}'

Recommended environment flags for ROCm/vLLM testing:

export VLLM_ROCM_USE_AITER=0
export OMP_NUM_THREADS=1
export TOKENIZERS_PARALLELISM=false
export TORCHDYNAMO_DISABLE=1
export RAY_CGRAPH_get_timeout=1800
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_SAMPLER=0

If this mixed BF16/INT4 MoE artifact hits a ROCm/vLLM MoE backend issue, also test with:

export VLLM_USE_FLASHINFER_MOE_FP16=0

Backend behavior can vary across vLLM, ROCm, and toolbox builds.

Quantization recipe

Precision map

Component Precision Rationale
Self-attention q/k/v/o projections BF16 Preserves long-context coherence; attention was intentionally not quantized.
Router / MoE gate BF16 Protects expert routing quality at low memory cost.
Embeddings and lm_head BF16 Standard low-risk preservation choice.
LayerNorms BF16 Small memory cost; avoids unnecessary numerical risk.
Main-model MoE experts, layers 0-57, w1/w2/w3 INT4 W4A16 AWQ, group size 32 Main parameter savings.
Main-model MoE experts, layers 58-61, w1/w2/w3 BF16 Late-layer carve-out intended to protect reasoning/generation behavior.
MTP module experts INT4 W4A16, group size 32 Quantized by the broad expert target pattern.

AWQ recipe

The recipe uses a single AWQModifier, rather than splitting AWQ smoothing and quantization into separate modifier passes.

default_stage:
  default_modifiers:
    AWQModifier:
      mappings:
        - smooth_layer: "re:.*post_attention_layernorm$"
          balance_layers: ["re:.*w1$", "re:.*w3$"]
        - smooth_layer: "re:.*w3$"
          balance_layers: ["re:.*w2$"]
      ignore:
        - "lm_head"
        - "embed_tokens"
        - "re:.*self_attn.*"
        - "re:.*block_sparse_moe\\.gate$"
        - "re:.*\\.layers\\.(58|59|60|61)\\.block_sparse_moe\\.experts\\.[0-9]+\\.(w1|w2|w3)$"
      config_groups:
        mlp_experts_projections:
          targets:
            - "re:.*block_sparse_moe\\.experts\\.[0-9]+\\.(w1|w2|w3)$"
          weights:
            num_bits: 4
            type: int
            symmetric: true
            group_size: 32
            strategy: group
            dynamic: false
            observer: minmax
      duo_scaling: true

Calibration

Setting Value
Calibration samples 256
Max calibration sequence length 4,096
Seed 42
Quantization hardware 8× A100 80GB, 1+ TB NVMe scratch, ~512 GB system RAM

Calibration data mix:

  • 40% NVIDIA Llama-Nemotron SFT chat/science
  • 25% code from NVIDIA Llama-Nemotron SFT code and bigcode/the-stack-smol
  • 20% math from NVIDIA Llama-Nemotron SFT math, GSM8K, and MATH-500
  • 15% long-context samples from DKYoon/SlimPajama-6B with at least 4K-token texts

Why this quantization recipe is different

This quant intentionally spends more memory than smaller all-expert INT4 quants. The main design choices are:

  1. Group size 32 AWQ: smaller groups increase metadata/scale overhead but generally improve weight reconstruction quality compared with larger groups.
  2. BF16 attention: attention projections are left unquantized to protect long-context behavior.
  3. BF16 router gates: MoE routing is left unquantized to avoid compounding expert-selection errors.
  4. BF16 final expert layers: the last four main-model expert layers are preserved in BF16 as a late-generation/reasoning carve-out.
  5. Diverse calibration set: calibration uses code, math, chat/science, and long-context samples instead of a single-domain corpus.
  6. Unified AWQ modifier: smoothing and quantization remain in one AWQ modifier path, avoiding the earlier split-modifier failure mode observed during development.

These choices make the artifact larger than a data-free INT4 AWQ quant, but are intended to preserve more of the BF16 model's behavior under long-context and reasoning-heavy workloads.

Memory and context behavior on 2× Strix Halo

The model weights occupy approximately 145 GiB in memory. With vLLM tensor parallelism across two Strix Halo systems, the tested configuration uses:

  • --tensor-parallel-size 2
  • --max-model-len 196608
  • --max-num-seqs 2
  • --max-num-batched-tokens 20480
  • --gpu-memory-utilization 0.92

Estimated BF16 KV-cache usage for MiniMax-M2.7 at 196,608 tokens:

Active sequences KV cache estimate Practical implication
1 ~46.51 GiB The configured 196,608-token context is the intended long-context target.
2 ~93.03 GiB Two simultaneous full-length 196K requests are not expected to fit comfortably.

With all remaining memory assigned to KV cache, the one-sequence theoretical context ceiling is roughly 230K-280K tokens depending on runtime overhead and how much Strix Halo unified memory ROCm/vLLM exposes as usable GPU memory. The public serving configuration is still capped at 196,608 tokens because that is the tested long-context target with activation and startup headroom.

In this configuration, --max-num-seqs=2 should be interpreted as a peak-concurrency setting. For two simultaneous long-context requests, practical effective context is expected to be substantially lower than 196K per request, around the 110K-120K range per request for this size class.

Benchmarks and validation

Benchmark results

The following benchmark comparison was run against the unquantized reference, QuantTrio's AWQ quant, and this quant:

Benchmark Unquantized QuantTrio MiniMax-M2.7-AWQ-G32-STRIX-2H
HumanEval 91.46% 84.15% 90.24%
MBPP Plus 93.29% 87.20% 92.07%
MMLU College CS 96.00% 99.00% 97.00%
MMLU Computer Security 90.00% 89.00% 88.00%
MMLU Machine Learning 91.96% 91.07% 89.29%
GSM8K slice 100.00% 100.00% 100.00%

Summary versus QuantTrio:

  • HumanEval: +6.09 percentage points over QuantTrio, within 1.22 points of the unquantized reference.
  • MBPP Plus: +4.87 percentage points over QuantTrio, within 1.22 points of the unquantized reference.
  • GSM8K slice: tied at 100.00% across all three.

These results match the recipe goal: the larger mixed BF16/INT4 artifact recovers most of the unquantized code benchmark performance while substantially outperforming the smaller QuantTrio AWQ quant on HumanEval and MBPP Plus.

Artifact validation

This artifact has been validated for structural recovery and serving:

  • 32 safetensors shards recovered.
  • Artifact size verified at 155,271,743,469 bytes.
  • vLLM route started successfully on a 2× Strix Halo Ray setup.
  • OpenAI-compatible /v1/chat/completions smoke test returned HTTP 200.
  • Smoke prompt Compute 47*53 step by step. returned the correct result, 2491.

Recommended additional evaluation set for users who want to reproduce or extend the comparison:

  • HumanEval
  • MBPP / MBPP+
  • MMLU subsets
  • GSM8K
  • MATH-500
  • AIME 2024
  • GPQA
  • RULER at long-context lengths such as 32K, 64K, and 128K

Intended use

  • Local and research inference on 2× Strix Halo systems.
  • Long-context testing with vLLM on AMD ROCm.
  • A/B testing mixed BF16/INT4 precision strategies for MiniMax-M2.7.
  • Reasoning, code, math, and long-context experiments where local inference is preferred.

Limitations

  • Not intended for single-GPU Strix Halo deployment.
  • Benchmarks are limited to the listed suite and should not be interpreted as broad safety or capability evaluation.
  • The artifact is larger than data-free all-expert INT4 quants because it intentionally preserves BF16 attention, routing, and late expert layers.
  • Effective context depends on vLLM version, ROCm build, allocator behavior, runtime overhead, and concurrency.

License

This quant is a derivative of amd/MiniMax-M2.7-BF16 / MiniMaxAI/MiniMax-M2.7 and follows the upstream MiniMax-M2.7 license.

MiniMax-M2.7 is released under a custom non-commercial license: non-commercial use is permitted under MIT-style terms, while commercial use requires prior written authorization from MiniMax. See the included LICENSE file and the upstream license for details.

Citation / attribution

Downloads last month
51
Safetensors
Model size
51B params
Tensor type
BF16
·
I64
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ayysasha/MiniMax-M2.7-AWQ-G32-STRIX-2H

Quantized
(1)
this model