Qwen 3.6 27B — UDT MTP GGUF

UDT (UD-Turbo) dynamic-imatrix quants of Qwen 3.6 27B (dense), built on top of atomic-llama-cpp-turboquant — a llama.cpp fork with TurboQuant WHT-rotated KV cache + shared-model NextN speculative decoding.

These are combined *_MTP.gguf files: the NextN auxiliary head ships inside the same GGUF as the target weights, so you point --model-draft at the same file and the server reuses the loaded llama_model (no second mmap, no second tokenizer).

UDT is not the same artifact as Unsloth's UD-* — it denotes our mask line on top of the same public MTP-aware imatrix.


Files

Quantized weights (combined _MTP.gguf, NextN head included)

File Bits Size wikitext-2 PPL
Qwen3.6-27B-UDT-Q3_K_XL_MTP.gguf ~3.4 15.7 GiB 7.056 ± 0.047
Qwen3.6-27B-UDT-Q4_K_XL_MTP.gguf ~4.5 17.7 GiB 6.972 ± 0.046
Qwen3.6-27B-UDT-Q5_K_XL_MTP.gguf ~5.5 19.6 GiB 6.896 ± 0.045
Qwen3.6-27B-UDT-Q6_K_MTP.gguf ~6.5 21.6 GiB 6.929 ± 0.046
Qwen3.6-27B-UDT-Q8_K_XL_MTP.gguf ~8.0 25.5 GiB (≈ BF16 reference)

PPL measured with llama-perplexity over wikitext-2-raw/wiki.test.raw, 580 chunks, n_ctx=512, NVIDIA H100. Recommended quant: Q4_K_XL — best PPL/size + smallest acceptable size to fit on a 24–32 GB GPU with TurboQuant3 KV. Use Q8_K_XL for near-lossless quality.

Multimodal projector (vision) — pass via --mmproj

File Size Notes
mmproj-F16.gguf 0.87 GiB recommended default
mmproj-BF16.gguf 0.87 GiB identical accuracy, BF16 storage

The projector is mirrored verbatim from unsloth/Qwen3.6-27B-MTP-GGUF — no changes from this repo, re-hosted for convenience so you can grab everything in one -hf line.

Importance matrix

File Size Source
imatrix_unsloth.gguf_file 13 MiB unsloth/Qwen3.6-27B-MTP-GGUF (MTP-aware, 77 chunks) — re-hosted for reproducibility, all credit to Unsloth

What's special

UDT applies three layers on top of plain llama-quantize -tt:

  1. MTP-aware imatrix — the imatrix_unsloth.gguf_file from Unsloth's Qwen3.6-27B-MTP-GGUF repo, calibrated with the NextN head active.
  2. NEXTN-preserve mask — every blk.*.nextn.* and mtp.* tensor pinned to Q8_0. Cost: ~0 PPL, ~10 MB; gain: higher draft acceptance with speculative decoding.
  3. TurboQuant3-friendly mask — attention Q/K (attn_q/attn_k) bumped to Q6_K to absorb the noise introduced by 3-bit KV compression (-ctk turbo3 -ctv turbo3).

The combined mask is scripts/quantize-masks/qwen36-ud-v3-combined.txt in the repo. Variant tags -V1, -V2, -base correspond to NEXTN-only, TurboQuant3-only, and the imatrix-only baseline; release files use the combined V3 mask.


Bench (MacBook Pro M4 Max, 40-core GPU, 48 GB, Metal, single slot)

A/B against unsloth/Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL. Median TPS over 2 runs, --draft-max 2 --draft-min 1.

mode reference UD-Q4_K_XL UDT-Q4_K_XL Δ tps accept
f16-base (n=128 / n=512) 21.38 / 20.85 21.09 / 20.86 ~equal
turbo3-base 20.18 / 20.02 19.57 / 19.20 −3%
f16-nextn 25.00 / 23.22 23.49 / 22.89 −3 / −1% 91.0 / 87.5 %
turbo3-nextn 21.93 / 20.60 23.32 / 21.78 +6 / +6% 95.4 / 84.8 %

UDT wins the recommended turbo3-nextn mode by +6 % tps and +10 pp acceptance (short) — the combination this mask was designed for.


Quick start

# llama.cpp build needs TurboQuant + NextN patches:
#   https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant

# 1) recommended: NextN + TurboQuant3 KV
llama-server \
  -hf  AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
  -hfd AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
  --spec-type nextn --draft-max 2 --draft-min 1 \
  -c 8192 -ngl 99 -ngld 99 -fa on \
  -ctk turbo3 -ctv turbo3

# 2) or with a local file (point -m and -md at the same GGUF)
llama-server \
  -m  ./Qwen3.6-27B-UDT-Q4_K_XL_MTP.gguf \
  -md ./Qwen3.6-27B-UDT-Q4_K_XL_MTP.gguf \
  --spec-type nextn --draft-max 2 --draft-min 1 \
  -c 8192 -ngl 99 -ngld 99 -fa on \
  -ctk turbo3 -ctv turbo3

Helper script in the repo: scripts/run-qwen36-27b-nextn-server.sh.

Vision (multimodal)

Pass the projector with --mmproj:

llama-server \
  -m  ./Qwen3.6-27B-UDT-Q4_K_XL_MTP.gguf \
  --mmproj ./mmproj-F16.gguf \
  -c 8192 -ngl 99 -fa on

Reproduce

From a Unsloth BF16 MTP shard + Unsloth imatrix:

git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant
cd atomic-llama-cpp-turboquant
cmake -B build -DGGML_CUDA=1 && cmake --build build -j --target llama-quantize

# download BF16 + imatrix to .scratch/qwen-ud-sources/27b/
bash scripts/qwen-udt/hf-download-sources.sh

# quantize Q4_K_XL with the V3 (release) mask
./scripts/quantize-qwen-udt.sh 27b Q4_K_M v3

Mask files: scripts/quantize-masks/qwen36-ud-{base,v1-nextn,v2-turbo3,v3-combined}.txt. Full runbook: docs/qwen-udt/RUNBOOK.md.


Credits & license

License: Apache-2.0 (inherits from the upstream Qwen 3.6 weights).

Downloads last month
2,392
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(344)
this model

Collection including AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF