Instructions to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini", filename="gemma4-19b-a4b-REAP-heretic-APEX-mini.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini # Run inference directly in the terminal: llama-cli -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini # Run inference directly in the terminal: llama-cli -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini # Run inference directly in the terminal: ./llama-cli -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini # Run inference directly in the terminal: ./build/bin/llama-cli -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
Use Docker
docker model run hf.co/Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
- LM Studio
- Jan
- Ollama
How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with Ollama:
ollama run hf.co/Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
- Unsloth Studio new
How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini to start chatting
- Pi new
How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
Run Hermes
hermes
- Docker Model Runner
How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with Docker Model Runner:
docker model run hf.co/Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
- Lemonade
How to use Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
Run and chat with the model
lemonade run user.gemma-4-19b-e4b-it-REAP-heretic-APEX-mini-{{QUANT_TAG}}List all available models
lemonade list
Gemma 4 19B-A4B-IT REAP Heretic — APEX Q3_K_M
APEX-quantized GGUF model of gemma-4-19b-a4b-it-REAP-heretic using importance-matrix calibration and tensor-specific quantization.
| Property | Value |
|---|---|
| Source Model | coder3101/gemma-4-19b-a4b-it-REAP-heretic |
| Quantization | Q3_K_M with imatrix + tensor-type config |
| Model Size | ~11 GB (F16: ~37 GB) |
| BPW | 5.07 (Bits Per Weight) |
| Tensors | 658 total, 60 with fallback quantization |
Model Lineage
This model went through several processing stages:
google/gemma-4-26b-a4b-it (Original, 26B)
↓ REAP (30% Expert Pruning)
0xSero/gemma-4-19b-a4b-it-REAP (19B)
↓ Heretic/Abliteration (ARA)
coder3101/gemma-4-19b-a4b-it-REAP-heretic (uncensored)
↓ APEX Quantization (this upload)
Benjamin-Wegener/gemma-4-19b-a4b-it-REAP-heretic-APEX-mini (GGUF Q3_K_M, ~11 GB)
APEX Quantization
This quantization follows the APEX approach by mudler, which optimizes MoE models through tensor-specific quantization strategies and importance-matrix calibration.
Steps Performed
F16 GGUF Conversion
Safetensors →gemma4-19b-a4b-reap-heretic-f16.gguf(36.9 GB)
Usingconvert_hf_to_gguf.pyfrom llama.cppImportance Matrix Generation
Calibration with ~48,600 tokens from diverse sources:- Multi-turn chat (~30%)
- Code (~25%)
- Reasoning (~25%)
- Tool-calling (~20%)
Source: apex-quant/apex_calibration_data
Tensor-Specific Quantization
Using the matched configuration filegemma4_19b_REAP_heretic_mini.txtwith:- Q8_0: Router tensors (
ffn_gate_inp) - Q5_K: Shared FFN (
ffn_gate,ffn_up,ffn_down) in later layers - Q4_K: Attention tensors (
attn_q,attn_k,attn_v,attn_output) - Q3_K: Fused expert tensors (
ffn_gate_up_exps,ffn_down_exps)
- Q8_0: Router tensors (
Quantization Config Highlights
| Tensor Type | Quantization | Rationale |
|---|---|---|
ffn_gate_inp (router) |
Q8_0 | Router logits need high precision |
ffn_gate_up_exps |
Q3_K | Largest tensors, aggressive compression |
ffn_down_exps |
Q3_K | Largest tensors, aggressive compression |
ffn_gate/up/down (shared) |
Q4_K–Q5_K | Fewer experts, higher precision |
attn_q/k/v/output |
Q3_K–Q4_K | Attention varies by layer |
Full config: gemma4_19b_REAP_heretic_mini.txt
Result
Model size (F16): 35,206.24 MiB (16.01 BPW)
Quant size: 11,149.34 MiB (5.07 BPW)
Compression: ~70% size reduction
GPU Offloading (Vulkan)
All 31 layers successfully offloaded to GPU:
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon 680M)
load_tensors: offloaded 31/31 layers to GPU
load_tensors: Vulkan0 model buffer size = 11,149.34 MiB
Requires llama.cpp compiled with -DGGML_VULKAN=ON.
Usage
llama.cpp
# Download
huggingface-cli download Benjamin-Wegener/gemma-4-19b-a4b-it-REAP-heretic-APEX-mini \
--include "*.gguf"
# Inference
llama-cli \
-m gemma4-19b-a4b-reap-heretic-APEX-mini.gguf \
-p "Explain quantum computing" \
-n 512 \
-ngl 99
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="gemma4-19b-a4b-reap-heretic-APEX-mini.gguf",
n_gpu_layers=-1, # All layers on GPU
n_ctx=8192, # Context size
verbose=False
)
messages = [
{"role": "user", "content": "Write a Python function for binary search."}
]
output = llm.create_chat_completion(
messages=messages,
max_tokens=2048,
temperature=0.7,
top_p=0.95,
top_k=64
)
print(output["choices"][0]["message"]["content"])
OpenAI-Compatible Server
llama-server \
-m gemma4-19b-a4b-reap-heretic-APEX-mini.gguf \
--host 0.0.0.0 \
--port 8080 \
--n-gpu-layers 99 \
--ctx-size 8192
Then use with any OpenAI-compatible client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
response = client.chat.completions.create(
model="gemma-4-19b",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=1024
)
Model Architecture
| Property | Value |
|---|---|
| Architecture | Gemma4ForCausalLM (MoE) |
| Total Parameters | ~19.02B |
| Active Parameters/Token | ~4B |
| Experts per Layer | 90 (of 128, 30% removed) |
| Active Experts/Token | 8 |
| Transformer Layers | 30 |
| Embedding Size | 2816 |
| Vocabulary | 262,144 tokens |
| Context Window | 262,144 tokens |
| Sliding Window | 1024 (25 layers), full attention (layers 5, 11, 17, 23, 29) |
REAP Pruning (Pre-Quantization)
The source model was compressed using REAP (Router-weighted Expert Activation Pruning):
| Metric | Original (26B) | REAP 30% (19B) |
|---|---|---|
| Total Parameters | ~26B | 19.02B |
| Experts/Layer | 128 | 90 |
| Active Params/Tok | ~4B | ~4B |
| Disk Size (BF16) | ~52 GB | ~36 GB |
REAP removes 30% of MoE experts (38 of 128 per layer) while preserving routing behavior.
Heretic Abliteration (Pre-Quantization)
Uncensored behavior was achieved using Heretic v1.2.0 with the Arbitrary-Rank Ablation (ARA) method:
| Parameter | Value |
|---|---|
start_layer_index |
14 |
end_layer_index |
25 |
preserve_good_behavior_weight |
0.7884 |
steer_bad_behavior_weight |
0.0002 |
overcorrect_relative_weight |
1.0972 |
neighbor_count |
7 |
Result: Refusals reduced from 94/100 → 6/100 with KL divergence of only 0.0290.
Important Notes
Imatrix-Based Quantization
Unlike naive quantization, APEX uses an importance matrix determined during calibration. This matrix identifies which tensors and weights are critical for model quality, enabling:
- Tensor-specific quantization levels
- Better quality at the same bitrate
- MoE-specific handling of expert tensors
Note: The imatrix was generated using a diverse calibration dataset (no Wikipedia!), covering chat, code, reasoning, and tool-calling.
Fallback Quantization
60 of 658 tensors required fallback quantization. These are primarily normalization and scaling tensors that were kept in F32.
Expected Quality
The APEX method (imatrix + tensor-specific config) is expected to deliver higher quality than standard Q3_K_M quantization without calibration. The matched config accounts for the special tensor architecture of this REAP+Heretic model:
- Fused expert tensors:
ffn_gate_up_exps,ffn_down_exps - Shared FFN tensors:
ffn_gate,ffn_up,ffn_down - No
attn_von full-attention layers (5, 11, 17, 23, 29)
Acknowledgments & Sources
- Original Model: google/gemma-4-26b-a4b-it
- REAP Pruning: 0xSero/gemma-4-19b-a4b-it-REAP
- Heretic Abliteration: coder3101/gemma-4-19b-a4b-it-REAP-heretic
- APEX Quantization: mudler/apex-quant
- Custom Config & Repo: Benjamin-Wegener/apex-quant
- llama.cpp: ggml-org/llama.cpp
- REAP Paper: arxiv.org/abs/2510.13999
- Heretic: github.com/p-e-w/heretic
Citation
If you use this model in your work:
@misc{wegener2025gemma4-19b-apex,
title={{Gemma 4 19B-A4B-IT REAP Heretic APEX Q3\_K\_M}},
author={Wegener, Benjamin},
year={2025},
howpublished={\url{https://huggingface.co/Benjamin-Wegener/gemma-4-19b-a4b-it-REAP-heretic-APEX-mini}},
note={APEX-quantized model based on REAP-pruned and Heretic-abliterated Gemma 4}
}
For the underlying REAP method:
@inproceedings{lasby2025reap,
title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
author={Lasby, Mike and others},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
url={https://arxiv.org/abs/2510.13999}
}
License
This model is subject to the Gemma License of the original model. The quantization is considered a derivative work.
Created by: Benjamin-Wegener
Quantization Date: April 2025
APEX Repo: github.com/Benjamin-Wegener/apex-quant
- Downloads last month
- 450
We're not able to determine the quantization variants.
Model tree for Benjamin-Wegener/gemma-4-19b-e4b-it-REAP-heretic-APEX-mini
Base model
0xSero/gemma-4-19b-a4b-it-REAP