Gemma 4 A4B 109-Expert (22.4B)

Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 109 experts per MoE layer.

Original (128e) This model (109e) Delta
Total params 26B 22.4B -14%
Experts per layer 128 109 -19/layer
Top-k routing 8 8
GGUF Q4_K_M size 16 GB 14 GB -12.5%
bf16 size 50 GB 42 GB -16%
GPQA Diamond 75.25% ± 3.07% 71.72% ± 3.21% -3.5 pp

Google's published GPQA Diamond score for the original model is 82.3% (bf16, generative CoT evaluation). Both our scores use identical evaluation settings: Q4_K_M quantization via llama.cpp, gpqa_diamond_cot_zeroshot task, 32k context, DeepSeek reasoning format with 8192 token budget. The gap from Google's score is due to quantization and evaluation pipeline differences.

The expert pruning costs only 3.5 percentage points while removing 14% of parameters.

Pruning Method

Expert Contribution Analysis

The pruning decision was based on measuring the actual contribution of each expert to the model's residual stream, not just routing frequency or activation patterns.

Two-phase profiling (expert_contribution_v2.py):

  1. Phase 1 — Generate: Run prompts through the model with CoT thinking enabled (128 new tokens per prompt). No hooks, full-speed generation.
  2. Phase 2 — Profile: Replay the full prompt+response sequence in a single forward pass with hooks that recompute each expert's output and measure ||routing_weight * expert_output||_2 — the actual L2 norm of each expert's weighted contribution to the residual stream.

Calibration data: 8 prompts across 5 domains (math, logic, code, science, creative writing) = 40 prompts total, generating 128 tokens each with thinking enabled.

Key Findings

  • Experts are NOT topic-specialized: Top-32 experts overlap 28/32 between math and creative writing. The same experts handle all domains.
  • Contribution is moderately concentrated: Top-8 experts capture ~24% of total weighted output norm. ~75 experts are needed to reach 80% of contribution. Gini coefficient ~0.38.
  • Per-layer variation is significant: Some layers are highly concentrated (Layer 10: top expert contributes 46%), others are distributed (Layer 29: top-8 only 33%).
  • Bottom experts contribute almost nothing: In concentrated layers like L4, the bottom 64 experts contribute only 0.7% of total output.
  • Expert weight similarity is near zero: Cosine similarity between expert weight matrices maxes at ~0.05 — merging experts by averaging weights destroys the model (confirmed experimentally: a 64-expert merged model produced repetition loops and scored 0% on GPQA).

Pruning Decision

Uniform 109 experts per layer (19 dropped per layer), selected as the most conservative threshold where all layers retain >99% of weighted output norm:

Layer Retained % Layer Retained % Layer Retained %
L0 99.66% L10 99.88% L20 99.88%
L1 99.58% L11 99.19% L21 99.85%
L2 99.52% L12 99.25% L22 99.99%
L3 99.87% L13 99.42% L23 99.89%
L4 99.97% L14 99.01% L24 99.93%
L5 99.90% L15 99.10% L25 99.88%
L6 99.75% L16 99.80% L26 99.90%
L7 99.27% L17 99.74% L27 99.96%
L8 99.20% L18 99.88% L28 99.78%
L9 99.16% L19 99.72% L29 99.70%

Worst case: Layer 14 at 99.01%. Different experts are dropped in each layer based on per-layer contribution ranking. The drop map is stored in expert_drop_metadata.json.

Why Not Merge Experts?

Expert merging (averaging weights of similar experts) was attempted and failed:

  • 64-expert merged model: Completely broken — repetition loops, 0% GPQA
  • 96-expert merged model: Degraded quality
  • Root cause: Weight-space cosine similarity between experts is ~0.05 (essentially zero). Experts that activate on similar tokens compute completely different functions. Averaging their weights destroys both functions.

The router proj.weight is resized from [128, hidden] to [109, hidden], keeping only the rows corresponding to retained experts. The router naturally adapts since it was trained to select from the full set — removed experts simply become unavailable, and the top-8 selection falls to the next-best experts.

GPQA Diamond Evaluation

Setup

  • Quantization: GGUF Q4_K_M via llama.cpp llama-quantize
  • Inference: llama.cpp server (llama-server)
  • Evaluation: lm-evaluation-harness v0.4.11, task gpqa_diamond_cot_zeroshot
  • Backend: local-chat-completions against llama.cpp OpenAI-compatible API

Configuration

Parameter Value
Context size 32768 tokens
Reasoning format deepseek (separates thinking into reasoning_content)
Reasoning budget 8192 tokens (caps CoT thinking)
max_gen_toks 16384
Temperature 0 (greedy)
GPU NVIDIA RTX 3090 (24 GB)
GPU layers 99 (fully offloaded)

The reasoning budget is critical: without it, the model enters overthinking loops on hard questions ("Wait, let me re-read...") and exhausts the full context without committing to an answer. This is a property of the base model, not the pruned version — the 128-expert original exhibits identical behavior. With the 8192-token budget, llama.cpp's DeepSeek reasoning format caps the thinking phase and forces a concise answer.

Results

Metric Score
flexible-extract 71.72% ± 3.21%
strict-match 0.51% ± 0.51%

198 questions, all answered (only 1 invalid extraction).

Error Analysis

Of 198 questions:

  • 142 correct (71.7%)
  • 55 wrong answer (27.8%) — model committed to an incorrect choice
  • 1 invalid extraction (0.5%) — answer not parseable

Wrong answers are evenly distributed across choices (A: 12, B: 11, C: 16, D: 14), indicating no systematic bias. The strict-match score is near zero because the model uses CoT reasoning format, not bare letter answers.

Response statistics:

  • Median response length: 2,536 characters (~630 tokens)
  • Mean: 3,840 characters (~960 tokens)
  • 6 responses exceeded 20k characters (hard questions hitting the reasoning budget)
  • 1 response exceeded 30k characters

Note on Evaluation Methodology

Both 109e and 128e reference were evaluated with identical settings for a direct apples-to-apples comparison. Google's published 82.3% uses bf16 precision with their own evaluation pipeline. Our Q4_K_M + llama.cpp pipeline scores the full 128e model at 75.25%, indicating ~7 points of quality loss from quantization and eval methodology differences. The expert pruning itself costs only 3.5 additional points (75.25% → 71.72%).

Architecture

Unchanged from the original except num_experts: 109 (was 128):

  • Layers: 30
  • Hidden size: 2816
  • Expert intermediate size: 704 (per expert)
  • Dense MLP intermediate size: 2112 (always active)
  • Top-k routing: 8
  • Attention: Hybrid sliding (5) + global (1) pattern
  • Vocabulary: 262,144

Files

  • config.json — Model config with num_experts: 109
  • model-*.safetensors — Model weights (9 shards, 42 GB total bf16)
  • expert_drop_metadata.json — Per-layer keep/drop expert indices and methodology
  • gpqa_results/ — Full GPQA Diamond evaluation results and per-sample logs

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("path/to/gemma-4-A4B-109e",
    torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("path/to/gemma-4-A4B-109e")

Or convert to GGUF for llama.cpp:

python llama.cpp/convert_hf_to_gguf.py path/to/gemma-4-A4B-109e --outfile model-f16.gguf --outtype f16
llama.cpp/build/bin/llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M

Reproduction

Scripts for the full pipeline are in the scripts/ directory of the repository:

  1. expert_contribution_v2.py — Two-phase expert contribution analysis
  2. expert_drop.py — Expert pruning based on contribution drop map
  3. eval_gpqa_v3.sh — GPQA Diamond evaluation via llama.cpp + lm-eval

License

This model inherits the Gemma license from the base model.

Downloads last month
82
Safetensors
Model size
22B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-109e-it

Finetuned
(50)
this model
Quantizations
1 model