Gemma 4 A4B 109-Expert (22.4B)
Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 109 experts per MoE layer.
| Original (128e) | This model (109e) | Delta | |
|---|---|---|---|
| Total params | 26B | 22.4B | -14% |
| Experts per layer | 128 | 109 | -19/layer |
| Top-k routing | 8 | 8 | — |
| GGUF Q4_K_M size | 16 GB | 14 GB | -12.5% |
| bf16 size | 50 GB | 42 GB | -16% |
| GPQA Diamond | 75.25% ± 3.07% | 71.72% ± 3.21% | -3.5 pp |
Google's published GPQA Diamond score for the original model is 82.3% (bf16, generative CoT evaluation). Both our scores use identical evaluation settings: Q4_K_M quantization via llama.cpp, gpqa_diamond_cot_zeroshot task, 32k context, DeepSeek reasoning format with 8192 token budget. The gap from Google's score is due to quantization and evaluation pipeline differences.
The expert pruning costs only 3.5 percentage points while removing 14% of parameters.
Pruning Method
Expert Contribution Analysis
The pruning decision was based on measuring the actual contribution of each expert to the model's residual stream, not just routing frequency or activation patterns.
Two-phase profiling (expert_contribution_v2.py):
- Phase 1 — Generate: Run prompts through the model with CoT thinking enabled (128 new tokens per prompt). No hooks, full-speed generation.
- Phase 2 — Profile: Replay the full prompt+response sequence in a single forward pass with hooks that recompute each expert's output and measure
||routing_weight * expert_output||_2— the actual L2 norm of each expert's weighted contribution to the residual stream.
Calibration data: 8 prompts across 5 domains (math, logic, code, science, creative writing) = 40 prompts total, generating 128 tokens each with thinking enabled.
Key Findings
- Experts are NOT topic-specialized: Top-32 experts overlap 28/32 between math and creative writing. The same experts handle all domains.
- Contribution is moderately concentrated: Top-8 experts capture ~24% of total weighted output norm. ~75 experts are needed to reach 80% of contribution. Gini coefficient ~0.38.
- Per-layer variation is significant: Some layers are highly concentrated (Layer 10: top expert contributes 46%), others are distributed (Layer 29: top-8 only 33%).
- Bottom experts contribute almost nothing: In concentrated layers like L4, the bottom 64 experts contribute only 0.7% of total output.
- Expert weight similarity is near zero: Cosine similarity between expert weight matrices maxes at ~0.05 — merging experts by averaging weights destroys the model (confirmed experimentally: a 64-expert merged model produced repetition loops and scored 0% on GPQA).
Pruning Decision
Uniform 109 experts per layer (19 dropped per layer), selected as the most conservative threshold where all layers retain >99% of weighted output norm:
| Layer | Retained % | Layer | Retained % | Layer | Retained % |
|---|---|---|---|---|---|
| L0 | 99.66% | L10 | 99.88% | L20 | 99.88% |
| L1 | 99.58% | L11 | 99.19% | L21 | 99.85% |
| L2 | 99.52% | L12 | 99.25% | L22 | 99.99% |
| L3 | 99.87% | L13 | 99.42% | L23 | 99.89% |
| L4 | 99.97% | L14 | 99.01% | L24 | 99.93% |
| L5 | 99.90% | L15 | 99.10% | L25 | 99.88% |
| L6 | 99.75% | L16 | 99.80% | L26 | 99.90% |
| L7 | 99.27% | L17 | 99.74% | L27 | 99.96% |
| L8 | 99.20% | L18 | 99.88% | L28 | 99.78% |
| L9 | 99.16% | L19 | 99.72% | L29 | 99.70% |
Worst case: Layer 14 at 99.01%. Different experts are dropped in each layer based on per-layer contribution ranking. The drop map is stored in expert_drop_metadata.json.
Why Not Merge Experts?
Expert merging (averaging weights of similar experts) was attempted and failed:
- 64-expert merged model: Completely broken — repetition loops, 0% GPQA
- 96-expert merged model: Degraded quality
- Root cause: Weight-space cosine similarity between experts is ~0.05 (essentially zero). Experts that activate on similar tokens compute completely different functions. Averaging their weights destroys both functions.
The router proj.weight is resized from [128, hidden] to [109, hidden], keeping only the rows corresponding to retained experts. The router naturally adapts since it was trained to select from the full set — removed experts simply become unavailable, and the top-8 selection falls to the next-best experts.
GPQA Diamond Evaluation
Setup
- Quantization: GGUF Q4_K_M via llama.cpp
llama-quantize - Inference: llama.cpp server (
llama-server) - Evaluation: lm-evaluation-harness v0.4.11, task
gpqa_diamond_cot_zeroshot - Backend:
local-chat-completionsagainst llama.cpp OpenAI-compatible API
Configuration
| Parameter | Value |
|---|---|
| Context size | 32768 tokens |
| Reasoning format | deepseek (separates thinking into reasoning_content) |
| Reasoning budget | 8192 tokens (caps CoT thinking) |
| max_gen_toks | 16384 |
| Temperature | 0 (greedy) |
| GPU | NVIDIA RTX 3090 (24 GB) |
| GPU layers | 99 (fully offloaded) |
The reasoning budget is critical: without it, the model enters overthinking loops on hard questions ("Wait, let me re-read...") and exhausts the full context without committing to an answer. This is a property of the base model, not the pruned version — the 128-expert original exhibits identical behavior. With the 8192-token budget, llama.cpp's DeepSeek reasoning format caps the thinking phase and forces a concise answer.
Results
| Metric | Score |
|---|---|
| flexible-extract | 71.72% ± 3.21% |
| strict-match | 0.51% ± 0.51% |
198 questions, all answered (only 1 invalid extraction).
Error Analysis
Of 198 questions:
- 142 correct (71.7%)
- 55 wrong answer (27.8%) — model committed to an incorrect choice
- 1 invalid extraction (0.5%) — answer not parseable
Wrong answers are evenly distributed across choices (A: 12, B: 11, C: 16, D: 14), indicating no systematic bias. The strict-match score is near zero because the model uses CoT reasoning format, not bare letter answers.
Response statistics:
- Median response length: 2,536 characters (~630 tokens)
- Mean: 3,840 characters (~960 tokens)
- 6 responses exceeded 20k characters (hard questions hitting the reasoning budget)
- 1 response exceeded 30k characters
Note on Evaluation Methodology
Both 109e and 128e reference were evaluated with identical settings for a direct apples-to-apples comparison. Google's published 82.3% uses bf16 precision with their own evaluation pipeline. Our Q4_K_M + llama.cpp pipeline scores the full 128e model at 75.25%, indicating ~7 points of quality loss from quantization and eval methodology differences. The expert pruning itself costs only 3.5 additional points (75.25% → 71.72%).
Architecture
Unchanged from the original except num_experts: 109 (was 128):
- Layers: 30
- Hidden size: 2816
- Expert intermediate size: 704 (per expert)
- Dense MLP intermediate size: 2112 (always active)
- Top-k routing: 8
- Attention: Hybrid sliding (5) + global (1) pattern
- Vocabulary: 262,144
Files
config.json— Model config withnum_experts: 109model-*.safetensors— Model weights (9 shards, 42 GB total bf16)expert_drop_metadata.json— Per-layer keep/drop expert indices and methodologygpqa_results/— Full GPQA Diamond evaluation results and per-sample logs
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("path/to/gemma-4-A4B-109e",
torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("path/to/gemma-4-A4B-109e")
Or convert to GGUF for llama.cpp:
python llama.cpp/convert_hf_to_gguf.py path/to/gemma-4-A4B-109e --outfile model-f16.gguf --outtype f16
llama.cpp/build/bin/llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M
Reproduction
Scripts for the full pipeline are in the scripts/ directory of the repository:
expert_contribution_v2.py— Two-phase expert contribution analysisexpert_drop.py— Expert pruning based on contribution drop mapeval_gpqa_v3.sh— GPQA Diamond evaluation via llama.cpp + lm-eval
License
This model inherits the Gemma license from the base model.
- Downloads last month
- 82