Instructions to use AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF", filename="Qwen3.6-27B-UDT-Q3_K_XL_MTP.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP # Run inference directly in the terminal: llama-cli -hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP # Run inference directly in the terminal: llama-cli -hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP # Run inference directly in the terminal: ./llama-cli -hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP # Run inference directly in the terminal: ./build/bin/llama-cli -hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP
Use Docker
docker model run hf.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP
- LM Studio
- Jan
- Ollama
How to use AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF with Ollama:
ollama run hf.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP
- Unsloth Studio new
How to use AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF to start chatting
- Pi new
How to use AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP
Run Hermes
hermes
- Docker Model Runner
How to use AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF with Docker Model Runner:
docker model run hf.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP
- Lemonade
How to use AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL_MTP
Run and chat with the model
lemonade run user.Qwen3.6-27B-UDT-MTP-GGUF-Q4_K_XL_MTP
List all available models
lemonade list
Qwen 3.6 27B — UDT MTP GGUF
UDT (UD-Turbo) dynamic-imatrix quants of Qwen 3.6 27B (dense), built on top of
atomic-llama-cpp-turboquant— allama.cppfork with TurboQuant WHT-rotated KV cache + shared-model NextN speculative decoding.
These are combined *_MTP.gguf files: the NextN auxiliary head ships inside the same GGUF as the target weights, so you point --model-draft at the same file and the server reuses the loaded llama_model (no second mmap, no second tokenizer).
UDT is not the same artifact as Unsloth's UD-* — it denotes our mask line on top of the same public MTP-aware imatrix.
Files
Quantized weights (combined _MTP.gguf, NextN head included)
| File | Bits | Size | wikitext-2 PPL |
|---|---|---|---|
Qwen3.6-27B-UDT-Q3_K_XL_MTP.gguf |
~3.4 | 15.7 GiB | 7.056 ± 0.047 |
Qwen3.6-27B-UDT-Q4_K_XL_MTP.gguf |
~4.5 | 17.7 GiB | 6.972 ± 0.046 |
Qwen3.6-27B-UDT-Q5_K_XL_MTP.gguf |
~5.5 | 19.6 GiB | 6.896 ± 0.045 |
Qwen3.6-27B-UDT-Q6_K_MTP.gguf |
~6.5 | 21.6 GiB | 6.929 ± 0.046 |
Qwen3.6-27B-UDT-Q8_K_XL_MTP.gguf |
~8.0 | 25.5 GiB | (≈ BF16 reference) |
PPL measured with llama-perplexity over wikitext-2-raw/wiki.test.raw, 580 chunks, n_ctx=512, NVIDIA H100. Recommended quant: Q4_K_XL — best PPL/size + smallest acceptable size to fit on a 24–32 GB GPU with TurboQuant3 KV. Use Q8_K_XL for near-lossless quality.
Multimodal projector (vision) — pass via --mmproj
| File | Size | Notes |
|---|---|---|
mmproj-F16.gguf |
0.87 GiB | recommended default |
mmproj-BF16.gguf |
0.87 GiB | identical accuracy, BF16 storage |
The projector is mirrored verbatim from unsloth/Qwen3.6-27B-MTP-GGUF — no changes from this repo, re-hosted for convenience so you can grab everything in one -hf line.
Importance matrix
| File | Size | Source |
|---|---|---|
imatrix_unsloth.gguf_file |
13 MiB | unsloth/Qwen3.6-27B-MTP-GGUF (MTP-aware, 77 chunks) — re-hosted for reproducibility, all credit to Unsloth |
What's special
UDT applies three layers on top of plain llama-quantize -tt:
- MTP-aware imatrix — the
imatrix_unsloth.gguf_filefrom Unsloth'sQwen3.6-27B-MTP-GGUFrepo, calibrated with the NextN head active. - NEXTN-preserve mask — every
blk.*.nextn.*andmtp.*tensor pinned toQ8_0. Cost: ~0 PPL, ~10 MB; gain: higher draft acceptance with speculative decoding. - TurboQuant3-friendly mask — attention Q/K (
attn_q/attn_k) bumped toQ6_Kto absorb the noise introduced by 3-bit KV compression (-ctk turbo3 -ctv turbo3).
The combined mask is scripts/quantize-masks/qwen36-ud-v3-combined.txt in the repo. Variant tags -V1, -V2, -base correspond to NEXTN-only, TurboQuant3-only, and the imatrix-only baseline; release files use the combined V3 mask.
Bench (MacBook Pro M4 Max, 40-core GPU, 48 GB, Metal, single slot)
A/B against unsloth/Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL. Median TPS over 2 runs, --draft-max 2 --draft-min 1.
| mode | reference UD-Q4_K_XL |
UDT-Q4_K_XL | Δ tps | accept |
|---|---|---|---|---|
f16-base (n=128 / n=512) |
21.38 / 20.85 | 21.09 / 20.86 | ~equal | — |
turbo3-base |
20.18 / 20.02 | 19.57 / 19.20 | −3% | — |
f16-nextn |
25.00 / 23.22 | 23.49 / 22.89 | −3 / −1% | 91.0 / 87.5 % |
turbo3-nextn |
21.93 / 20.60 | 23.32 / 21.78 | +6 / +6% | 95.4 / 84.8 % |
UDT wins the recommended turbo3-nextn mode by +6 % tps and +10 pp acceptance (short) — the combination this mask was designed for.
Quick start
# llama.cpp build needs TurboQuant + NextN patches:
# https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant
# 1) recommended: NextN + TurboQuant3 KV
llama-server \
-hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
-hfd AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
--spec-type nextn --draft-max 2 --draft-min 1 \
-c 8192 -ngl 99 -ngld 99 -fa on \
-ctk turbo3 -ctv turbo3
# 2) or with a local file (point -m and -md at the same GGUF)
llama-server \
-m ./Qwen3.6-27B-UDT-Q4_K_XL_MTP.gguf \
-md ./Qwen3.6-27B-UDT-Q4_K_XL_MTP.gguf \
--spec-type nextn --draft-max 2 --draft-min 1 \
-c 8192 -ngl 99 -ngld 99 -fa on \
-ctk turbo3 -ctv turbo3
Helper script in the repo: scripts/run-qwen36-27b-nextn-server.sh.
Vision (multimodal)
Pass the projector with --mmproj:
llama-server \
-m ./Qwen3.6-27B-UDT-Q4_K_XL_MTP.gguf \
--mmproj ./mmproj-F16.gguf \
-c 8192 -ngl 99 -fa on
Reproduce
From a Unsloth BF16 MTP shard + Unsloth imatrix:
git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant
cd atomic-llama-cpp-turboquant
cmake -B build -DGGML_CUDA=1 && cmake --build build -j --target llama-quantize
# download BF16 + imatrix to .scratch/qwen-ud-sources/27b/
bash scripts/qwen-udt/hf-download-sources.sh
# quantize Q4_K_XL with the V3 (release) mask
./scripts/quantize-qwen-udt.sh 27b Q4_K_M v3
Mask files: scripts/quantize-masks/qwen36-ud-{base,v1-nextn,v2-turbo3,v3-combined}.txt. Full runbook: docs/qwen-udt/RUNBOOK.md.
Credits & license
- Qwen team (Qwen/Qwen3.6-27B) — base weights, Apache-2.0.
- Unsloth (
unsloth/Qwen3.6-27B-MTP-GGUF) — MTP-awareimatrix_unsloth.gguf_fileand BF16 MTP source GGUFs. Huge thanks to the Unsloth team for releasing these public artifacts that made UDT possible. - @TheTom (
TheTom/llama-cpp-turboquant) — original TurboQuant WHT-rotated quantization design. - AtomicChat — UDT mask recipes, NextN shared-model integration, benches, packaging. Repo:
AtomicBot-ai/atomic-llama-cpp-turboquant.
License: Apache-2.0 (inherits from the upstream Qwen 3.6 weights).
- Downloads last month
- 2,392
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF
Base model
Qwen/Qwen3.6-27B