Instructions to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini",
	filename="gemma4-19b-a4b-REAP-heretic-APEX-mini.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
# Run inference directly in the terminal:
llama-cli -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
# Run inference directly in the terminal:
llama-cli -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
# Run inference directly in the terminal:
./llama-cli -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini

Use Docker

docker model run hf.co/Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini

LM Studio
Jan
Ollama
How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with Ollama:
```
ollama run hf.co/Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
```

Unsloth Studio new

How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini to start chatting

Pi new

How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini

Run Hermes

hermes

Docker Model Runner
How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with Docker Model Runner:
```
docker model run hf.co/Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
```

Lemonade

How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini

Run and chat with the model

lemonade run user.gemma-4-19b-e4b-it-REAP-heretic-APEX-mini-{{QUANT_TAG}}

List all available models

lemonade list

Gemma 4 19B-A4B-IT REAP Heretic — APEX Q3_K_M

APEX-quantized GGUF model of gemma-4-19b-a4b-it-REAP-heretic using importance-matrix calibration and tensor-specific quantization.

Property	Value
Source Model	coder3101/gemma-4-19b-a4b-it-REAP-heretic
Quantization	Q3_K_M with imatrix + tensor-type config
Model Size	~11 GB (F16: ~37 GB)
BPW	5.07 (Bits Per Weight)
Tensors	658 total, 60 with fallback quantization

Model Lineage

This model went through several processing stages:

google/gemma-4-26b-a4b-it (Original, 26B)
    ↓ REAP (30% Expert Pruning)
0xSero/gemma-4-19b-a4b-it-REAP (19B)
    ↓ Heretic/Abliteration (ARA)
coder3101/gemma-4-19b-a4b-it-REAP-heretic (uncensored)
    ↓ APEX Quantization (this upload)
Benjamin-Wegener/gemma-4-19b-a4b-it-REAP-heretic-APEX-mini (GGUF Q3_K_M, ~11 GB)

APEX Quantization

This quantization follows the APEX approach by mudler, which optimizes MoE models through tensor-specific quantization strategies and importance-matrix calibration.

Steps Performed

F16 GGUF Conversion
Safetensors → gemma4-19b-a4b-reap-heretic-f16.gguf (36.9 GB)
Using convert_hf_to_gguf.py from llama.cpp
Importance Matrix Generation
Calibration with ~48,600 tokens from diverse sources:
- Multi-turn chat (~30%)
- Code (~25%)
- Reasoning (~25%)
- Tool-calling (~20%)
Source: apex-quant/apex_calibration_data
Tensor-Specific Quantization
Using the matched configuration file gemma4_19b_REAP_heretic_mini.txt with:
- Q8_0: Router tensors (ffn_gate_inp)
- Q5_K: Shared FFN (ffn_gate, ffn_up, ffn_down) in later layers
- Q4_K: Attention tensors (attn_q, attn_k, attn_v, attn_output)
- Q3_K: Fused expert tensors (ffn_gate_up_exps, ffn_down_exps)

Quantization Config Highlights

Tensor Type	Quantization	Rationale
`ffn_gate_inp` (router)	Q8_0	Router logits need high precision
`ffn_gate_up_exps`	Q3_K	Largest tensors, aggressive compression
`ffn_down_exps`	Q3_K	Largest tensors, aggressive compression
`ffn_gate/up/down` (shared)	Q4_K–Q5_K	Fewer experts, higher precision
`attn_q/k/v/output`	Q3_K–Q4_K	Attention varies by layer

Full config: gemma4_19b_REAP_heretic_mini.txt

Result

Model size (F16):  35,206.24 MiB (16.01 BPW)
Quant size:        11,149.34 MiB (5.07 BPW)
Compression:       ~70% size reduction

GPU Offloading (Vulkan)

All 31 layers successfully offloaded to GPU:

llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon 680M)
load_tensors: offloaded 31/31 layers to GPU
load_tensors: Vulkan0 model buffer size = 11,149.34 MiB

Requires llama.cpp compiled with -DGGML_VULKAN=ON.

Usage

llama.cpp

# Download
huggingface-cli download Benjamin-Wegener/gemma-4-19b-a4b-it-REAP-heretic-APEX-mini \
  --include "*.gguf"

# Inference
llama-cli \
  -m gemma4-19b-a4b-reap-heretic-APEX-mini.gguf \
  -p "Explain quantum computing" \
  -n 512 \
  -ngl 99

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="gemma4-19b-a4b-reap-heretic-APEX-mini.gguf",
    n_gpu_layers=-1,  # All layers on GPU
    n_ctx=8192,       # Context size
    verbose=False
)

messages = [
    {"role": "user", "content": "Write a Python function for binary search."}
]

output = llm.create_chat_completion(
    messages=messages,
    max_tokens=2048,
    temperature=0.7,
    top_p=0.95,
    top_k=64
)

print(output["choices"][0]["message"]["content"])

OpenAI-Compatible Server

llama-server \
  -m gemma4-19b-a4b-reap-heretic-APEX-mini.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 99 \
  --ctx-size 8192

Then use with any OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")

response = client.chat.completions.create(
    model="gemma-4-19b",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=1024
)

Model Architecture

Property	Value
Architecture	Gemma4ForCausalLM (MoE)
Total Parameters	~19.02B
Active Parameters/Token	~4B
Experts per Layer	90 (of 128, 30% removed)
Active Experts/Token	8
Transformer Layers	30
Embedding Size	2816
Vocabulary	262,144 tokens
Context Window	262,144 tokens
Sliding Window	1024 (25 layers), full attention (layers 5, 11, 17, 23, 29)

REAP Pruning (Pre-Quantization)

The source model was compressed using REAP (Router-weighted Expert Activation Pruning):

Metric	Original (26B)	REAP 30% (19B)
Total Parameters	~26B	19.02B
Experts/Layer	128	90
Active Params/Tok	~4B	~4B
Disk Size (BF16)	~52 GB	~36 GB

REAP removes 30% of MoE experts (38 of 128 per layer) while preserving routing behavior.

Heretic Abliteration (Pre-Quantization)

Uncensored behavior was achieved using Heretic v1.2.0 with the Arbitrary-Rank Ablation (ARA) method:

Parameter	Value
`start_layer_index`	14
`end_layer_index`	25
`preserve_good_behavior_weight`	0.7884
`steer_bad_behavior_weight`	0.0002
`overcorrect_relative_weight`	1.0972
`neighbor_count`	7

Result: Refusals reduced from 94/100 → 6/100 with KL divergence of only 0.0290.

Important Notes

Imatrix-Based Quantization

Unlike naive quantization, APEX uses an importance matrix determined during calibration. This matrix identifies which tensors and weights are critical for model quality, enabling:

Tensor-specific quantization levels
Better quality at the same bitrate
MoE-specific handling of expert tensors

Note: The imatrix was generated using a diverse calibration dataset (no Wikipedia!), covering chat, code, reasoning, and tool-calling.

Fallback Quantization

60 of 658 tensors required fallback quantization. These are primarily normalization and scaling tensors that were kept in F32.

Expected Quality

The APEX method (imatrix + tensor-specific config) is expected to deliver higher quality than standard Q3_K_M quantization without calibration. The matched config accounts for the special tensor architecture of this REAP+Heretic model:

Fused expert tensors: ffn_gate_up_exps, ffn_down_exps
Shared FFN tensors: ffn_gate, ffn_up, ffn_down
No attn_v on full-attention layers (5, 11, 17, 23, 29)

Acknowledgments & Sources

Original Model: google/gemma-4-26b-a4b-it
REAP Pruning: 0xSero/gemma-4-19b-a4b-it-REAP
Heretic Abliteration: coder3101/gemma-4-19b-a4b-it-REAP-heretic
APEX Quantization: mudler/apex-quant
Custom Config & Repo: Benjamin-Wegener/apex-quant
llama.cpp: ggml-org/llama.cpp
REAP Paper: arxiv.org/abs/2510.13999
Heretic: github.com/p-e-w/heretic

Citation

If you use this model in your work:

@misc{wegener2025gemma4-19b-apex,
  title={{Gemma 4 19B-A4B-IT REAP Heretic APEX Q3\_K\_M}},
  author={Wegener, Benjamin},
  year={2025},
  howpublished={\url{https://huggingface.co/Benjamin-Wegener/gemma-4-19b-a4b-it-REAP-heretic-APEX-mini}},
  note={APEX-quantized model based on REAP-pruned and Heretic-abliterated Gemma 4}
}

For the underlying REAP method:

@inproceedings{lasby2025reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
  author={Lasby, Mike and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.13999}
}

License

This model is subject to the Gemma License of the original model. The quantization is considered a derivative work.

Created by: Benjamin-Wegener
Quantization Date: April 2025
APEX Repo: github.com/Benjamin-Wegener/apex-quant

Downloads last month: 450

GGUF

Model size

18B params

Architecture

gemma4

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini

Base model

0xSero/gemma-4-19b-a4b-it-REAP

Finetuned

coder3101/gemma-4-19b-a4b-it-REAP-heretic

Quantized

(3)

this model

Paper for Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19