Gemma 4 A4B 109-Expert (22.4B)

Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 109 experts per MoE layer.

	Original (128e)	This model (109e)	Delta
Total params	26B	22.4B	-14%
Experts per layer	128	109	-19/layer
Top-k routing	8	8	—
GGUF Q4_K_M size	16 GB	14 GB	-12.5%
bf16 size	50 GB	42 GB	-16%
GPQA Diamond	75.25% ± 3.07%	71.72% ± 3.21%	-3.5 pp

Google's published GPQA Diamond score for the original model is 82.3% (bf16, generative CoT evaluation). Both our scores use identical evaluation settings: Q4_K_M quantization via llama.cpp, gpqa_diamond_cot_zeroshot task, 32k context, DeepSeek reasoning format with 8192 token budget. The gap from Google's score is due to quantization and evaluation pipeline differences.

The expert pruning costs only 3.5 percentage points while removing 14% of parameters.

Pruning Method

Expert Contribution Analysis

The pruning decision was based on measuring the actual contribution of each expert to the model's residual stream, not just routing frequency or activation patterns.

Two-phase profiling (expert_contribution_v2.py):

Phase 1 — Generate: Run prompts through the model with CoT thinking enabled (128 new tokens per prompt). No hooks, full-speed generation.
Phase 2 — Profile: Replay the full prompt+response sequence in a single forward pass with hooks that recompute each expert's output and measure ||routing_weight * expert_output||_2 — the actual L2 norm of each expert's weighted contribution to the residual stream.

Calibration data: 8 prompts across 5 domains (math, logic, code, science, creative writing) = 40 prompts total, generating 128 tokens each with thinking enabled.

Key Findings

Experts are NOT topic-specialized: Top-32 experts overlap 28/32 between math and creative writing. The same experts handle all domains.
Contribution is moderately concentrated: Top-8 experts capture ~24% of total weighted output norm. ~75 experts are needed to reach 80% of contribution. Gini coefficient ~0.38.
Per-layer variation is significant: Some layers are highly concentrated (Layer 10: top expert contributes 46%), others are distributed (Layer 29: top-8 only 33%).
Bottom experts contribute almost nothing: In concentrated layers like L4, the bottom 64 experts contribute only 0.7% of total output.
Expert weight similarity is near zero: Cosine similarity between expert weight matrices maxes at ~0.05 — merging experts by averaging weights destroys the model (confirmed experimentally: a 64-expert merged model produced repetition loops and scored 0% on GPQA).

Pruning Decision

Uniform 109 experts per layer (19 dropped per layer), selected as the most conservative threshold where all layers retain >99% of weighted output norm:

Layer	Retained %	Layer	Retained %	Layer	Retained %
L0	99.66%	L10	99.88%	L20	99.88%
L1	99.58%	L11	99.19%	L21	99.85%
L2	99.52%	L12	99.25%	L22	99.99%
L3	99.87%	L13	99.42%	L23	99.89%
L4	99.97%	L14	99.01%	L24	99.93%
L5	99.90%	L15	99.10%	L25	99.88%
L6	99.75%	L16	99.80%	L26	99.90%
L7	99.27%	L17	99.74%	L27	99.96%
L8	99.20%	L18	99.88%	L28	99.78%
L9	99.16%	L19	99.72%	L29	99.70%

Worst case: Layer 14 at 99.01%. Different experts are dropped in each layer based on per-layer contribution ranking. The drop map is stored in expert_drop_metadata.json.

Why Not Merge Experts?

Expert merging (averaging weights of similar experts) was attempted and failed:

64-expert merged model: Completely broken — repetition loops, 0% GPQA
96-expert merged model: Degraded quality
Root cause: Weight-space cosine similarity between experts is ~0.05 (essentially zero). Experts that activate on similar tokens compute completely different functions. Averaging their weights destroys both functions.

The router proj.weight is resized from [128, hidden] to [109, hidden], keeping only the rows corresponding to retained experts. The router naturally adapts since it was trained to select from the full set — removed experts simply become unavailable, and the top-8 selection falls to the next-best experts.

GPQA Diamond Evaluation

Setup

Quantization: GGUF Q4_K_M via llama.cpp llama-quantize
Inference: llama.cpp server (llama-server)
Evaluation: lm-evaluation-harness v0.4.11, task gpqa_diamond_cot_zeroshot
Backend: local-chat-completions against llama.cpp OpenAI-compatible API

Configuration

Parameter	Value
Context size	32768 tokens
Reasoning format	`deepseek` (separates thinking into `reasoning_content`)
Reasoning budget	8192 tokens (caps CoT thinking)
max_gen_toks	16384
Temperature	0 (greedy)
GPU	NVIDIA RTX 3090 (24 GB)
GPU layers	99 (fully offloaded)

The reasoning budget is critical: without it, the model enters overthinking loops on hard questions ("Wait, let me re-read...") and exhausts the full context without committing to an answer. This is a property of the base model, not the pruned version — the 128-expert original exhibits identical behavior. With the 8192-token budget, llama.cpp's DeepSeek reasoning format caps the thinking phase and forces a concise answer.

Results

Metric	Score
flexible-extract	71.72% ± 3.21%
strict-match	0.51% ± 0.51%

198 questions, all answered (only 1 invalid extraction).

Error Analysis

Of 198 questions:

142 correct (71.7%)
55 wrong answer (27.8%) — model committed to an incorrect choice
1 invalid extraction (0.5%) — answer not parseable

Wrong answers are evenly distributed across choices (A: 12, B: 11, C: 16, D: 14), indicating no systematic bias. The strict-match score is near zero because the model uses CoT reasoning format, not bare letter answers.

Response statistics:

Median response length: 2,536 characters (~630 tokens)
Mean: 3,840 characters (~960 tokens)
6 responses exceeded 20k characters (hard questions hitting the reasoning budget)
1 response exceeded 30k characters

Note on Evaluation Methodology

Both 109e and 128e reference were evaluated with identical settings for a direct apples-to-apples comparison. Google's published 82.3% uses bf16 precision with their own evaluation pipeline. Our Q4_K_M + llama.cpp pipeline scores the full 128e model at 75.25%, indicating ~7 points of quality loss from quantization and eval methodology differences. The expert pruning itself costs only 3.5 additional points (75.25% → 71.72%).

Architecture

Unchanged from the original except num_experts: 109 (was 128):

Layers: 30
Hidden size: 2816
Expert intermediate size: 704 (per expert)
Dense MLP intermediate size: 2112 (always active)
Top-k routing: 8
Attention: Hybrid sliding (5) + global (1) pattern
Vocabulary: 262,144

Files

config.json — Model config with num_experts: 109
model-*.safetensors — Model weights (9 shards, 42 GB total bf16)
expert_drop_metadata.json — Per-layer keep/drop expert indices and methodology
gpqa_results/ — Full GPQA Diamond evaluation results and per-sample logs

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("path/to/gemma-4-A4B-109e",
    torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("path/to/gemma-4-A4B-109e")

Or convert to GGUF for llama.cpp:

python llama.cpp/convert_hf_to_gguf.py path/to/gemma-4-A4B-109e --outfile model-f16.gguf --outtype f16
llama.cpp/build/bin/llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M

Reproduction

Scripts for the full pipeline are in the scripts/ directory of the repository:

expert_contribution_v2.py — Two-phase expert contribution analysis
expert_drop.py — Expert pruning based on contribution drop map
eval_gpqa_v3.sh — GPQA Diamond evaluation via llama.cpp + lm-eval

License

This model inherits the Gemma license from the base model.

Downloads last month: 82

Safetensors

Model size

22B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-109e-it

Base model

google/gemma-4-26B-A4B-it

Finetuned

(50)

this model

Quantizations

1 model