Gemma 4 19B-A4B-IT REAP Heretic — APEX Q3_K_M

APEX-quantized GGUF model of gemma-4-19b-a4b-it-REAP-heretic using importance-matrix calibration and tensor-specific quantization.

Property Value
Source Model coder3101/gemma-4-19b-a4b-it-REAP-heretic
Quantization Q3_K_M with imatrix + tensor-type config
Model Size ~11 GB (F16: ~37 GB)
BPW 5.07 (Bits Per Weight)
Tensors 658 total, 60 with fallback quantization

Model Lineage

This model went through several processing stages:

google/gemma-4-26b-a4b-it (Original, 26B)
    ↓ REAP (30% Expert Pruning)
0xSero/gemma-4-19b-a4b-it-REAP (19B)
    ↓ Heretic/Abliteration (ARA)
coder3101/gemma-4-19b-a4b-it-REAP-heretic (uncensored)
    ↓ APEX Quantization (this upload)
Benjamin-Wegener/gemma-4-19b-a4b-it-REAP-heretic-APEX-mini (GGUF Q3_K_M, ~11 GB)

APEX Quantization

This quantization follows the APEX approach by mudler, which optimizes MoE models through tensor-specific quantization strategies and importance-matrix calibration.

Steps Performed

  1. F16 GGUF Conversion
    Safetensors → gemma4-19b-a4b-reap-heretic-f16.gguf (36.9 GB)
    Using convert_hf_to_gguf.py from llama.cpp

  2. Importance Matrix Generation
    Calibration with ~48,600 tokens from diverse sources:

    • Multi-turn chat (~30%)
    • Code (~25%)
    • Reasoning (~25%)
    • Tool-calling (~20%)

    Source: apex-quant/apex_calibration_data

  3. Tensor-Specific Quantization
    Using the matched configuration file gemma4_19b_REAP_heretic_mini.txt with:

    • Q8_0: Router tensors (ffn_gate_inp)
    • Q5_K: Shared FFN (ffn_gate, ffn_up, ffn_down) in later layers
    • Q4_K: Attention tensors (attn_q, attn_k, attn_v, attn_output)
    • Q3_K: Fused expert tensors (ffn_gate_up_exps, ffn_down_exps)

Quantization Config Highlights

Tensor Type Quantization Rationale
ffn_gate_inp (router) Q8_0 Router logits need high precision
ffn_gate_up_exps Q3_K Largest tensors, aggressive compression
ffn_down_exps Q3_K Largest tensors, aggressive compression
ffn_gate/up/down (shared) Q4_K–Q5_K Fewer experts, higher precision
attn_q/k/v/output Q3_K–Q4_K Attention varies by layer

Full config: gemma4_19b_REAP_heretic_mini.txt

Result

Model size (F16):  35,206.24 MiB (16.01 BPW)
Quant size:        11,149.34 MiB (5.07 BPW)
Compression:       ~70% size reduction

GPU Offloading (Vulkan)

All 31 layers successfully offloaded to GPU:

llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon 680M)
load_tensors: offloaded 31/31 layers to GPU
load_tensors: Vulkan0 model buffer size = 11,149.34 MiB

Requires llama.cpp compiled with -DGGML_VULKAN=ON.

Usage

llama.cpp

# Download
huggingface-cli download Benjamin-Wegener/gemma-4-19b-a4b-it-REAP-heretic-APEX-mini \
  --include "*.gguf"

# Inference
llama-cli \
  -m gemma4-19b-a4b-reap-heretic-APEX-mini.gguf \
  -p "Explain quantum computing" \
  -n 512 \
  -ngl 99

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="gemma4-19b-a4b-reap-heretic-APEX-mini.gguf",
    n_gpu_layers=-1,  # All layers on GPU
    n_ctx=8192,       # Context size
    verbose=False
)

messages = [
    {"role": "user", "content": "Write a Python function for binary search."}
]

output = llm.create_chat_completion(
    messages=messages,
    max_tokens=2048,
    temperature=0.7,
    top_p=0.95,
    top_k=64
)

print(output["choices"][0]["message"]["content"])

OpenAI-Compatible Server

llama-server \
  -m gemma4-19b-a4b-reap-heretic-APEX-mini.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 99 \
  --ctx-size 8192

Then use with any OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")

response = client.chat.completions.create(
    model="gemma-4-19b",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=1024
)

Model Architecture

Property Value
Architecture Gemma4ForCausalLM (MoE)
Total Parameters ~19.02B
Active Parameters/Token ~4B
Experts per Layer 90 (of 128, 30% removed)
Active Experts/Token 8
Transformer Layers 30
Embedding Size 2816
Vocabulary 262,144 tokens
Context Window 262,144 tokens
Sliding Window 1024 (25 layers), full attention (layers 5, 11, 17, 23, 29)

REAP Pruning (Pre-Quantization)

The source model was compressed using REAP (Router-weighted Expert Activation Pruning):

Metric Original (26B) REAP 30% (19B)
Total Parameters ~26B 19.02B
Experts/Layer 128 90
Active Params/Tok ~4B ~4B
Disk Size (BF16) ~52 GB ~36 GB

REAP removes 30% of MoE experts (38 of 128 per layer) while preserving routing behavior.

Heretic Abliteration (Pre-Quantization)

Uncensored behavior was achieved using Heretic v1.2.0 with the Arbitrary-Rank Ablation (ARA) method:

Parameter Value
start_layer_index 14
end_layer_index 25
preserve_good_behavior_weight 0.7884
steer_bad_behavior_weight 0.0002
overcorrect_relative_weight 1.0972
neighbor_count 7

Result: Refusals reduced from 94/100 → 6/100 with KL divergence of only 0.0290.

Important Notes

Imatrix-Based Quantization

Unlike naive quantization, APEX uses an importance matrix determined during calibration. This matrix identifies which tensors and weights are critical for model quality, enabling:

  • Tensor-specific quantization levels
  • Better quality at the same bitrate
  • MoE-specific handling of expert tensors

Note: The imatrix was generated using a diverse calibration dataset (no Wikipedia!), covering chat, code, reasoning, and tool-calling.

Fallback Quantization

60 of 658 tensors required fallback quantization. These are primarily normalization and scaling tensors that were kept in F32.

Expected Quality

The APEX method (imatrix + tensor-specific config) is expected to deliver higher quality than standard Q3_K_M quantization without calibration. The matched config accounts for the special tensor architecture of this REAP+Heretic model:

  • Fused expert tensors: ffn_gate_up_exps, ffn_down_exps
  • Shared FFN tensors: ffn_gate, ffn_up, ffn_down
  • No attn_v on full-attention layers (5, 11, 17, 23, 29)

Acknowledgments & Sources

Citation

If you use this model in your work:

@misc{wegener2025gemma4-19b-apex,
  title={{Gemma 4 19B-A4B-IT REAP Heretic APEX Q3\_K\_M}},
  author={Wegener, Benjamin},
  year={2025},
  howpublished={\url{https://huggingface.co/Benjamin-Wegener/gemma-4-19b-a4b-it-REAP-heretic-APEX-mini}},
  note={APEX-quantized model based on REAP-pruned and Heretic-abliterated Gemma 4}
}

For the underlying REAP method:

@inproceedings{lasby2025reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
  author={Lasby, Mike and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.13999}
}

License

This model is subject to the Gemma License of the original model. The quantization is considered a derivative work.


Created by: Benjamin-Wegener
Quantization Date: April 2025
APEX Repo: github.com/Benjamin-Wegener/apex-quant

Downloads last month
450
GGUF
Model size
18B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini

Paper for Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini