Instructions to use fraQtl/Qwen3.6-35B-A3B-compressed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use fraQtl/Qwen3.6-35B-A3B-compressed with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="fraQtl/Qwen3.6-35B-A3B-compressed", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("fraQtl/Qwen3.6-35B-A3B-compressed", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use fraQtl/Qwen3.6-35B-A3B-compressed with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "fraQtl/Qwen3.6-35B-A3B-compressed"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fraQtl/Qwen3.6-35B-A3B-compressed",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/fraQtl/Qwen3.6-35B-A3B-compressed

SGLang

How to use fraQtl/Qwen3.6-35B-A3B-compressed with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "fraQtl/Qwen3.6-35B-A3B-compressed" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fraQtl/Qwen3.6-35B-A3B-compressed",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "fraQtl/Qwen3.6-35B-A3B-compressed" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fraQtl/Qwen3.6-35B-A3B-compressed",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use fraQtl/Qwen3.6-35B-A3B-compressed with Docker Model Runner:
```
docker model run hf.co/fraQtl/Qwen3.6-35B-A3B-compressed
```

Qwen3.6-35B-A3B — fraQtl Compressed

A compressed Qwen3.6-35B-A3B artifact. Reduces model footprint and unlocks long-context inference on a single GPU while preserving core generation quality.

This release is the public weight-compressed artifact. fraQtl runtime KV-cache compression is a separate early-access layer.

For production inference pilots, custom compression workflows, and newer artifacts, see fraqtl.ai/contact or email contact@fraqtl.ai.

⚠️ Required: install `fraqtl-runtime` first

Without fraqtl-runtime, this model will not load. The wheel ships the compiled loader that decodes the packed compressed weights. There is no Python-level "enable" call to make — install the wheel, then from_pretrained works.

pip install fraqtl-runtime

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "fraQtl/Qwen3.6-35B-A3B-compressed"
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True,
    torch_dtype=torch.bfloat16, device_map="auto",
)
tok = AutoTokenizer.from_pretrained(repo)

ids = tok("The capital of France is", return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**ids, max_new_tokens=20, do_sample=False)[0]))
# → "The capital of France is Paris, a city renowned for..."

That's the entire setup. Compression is part of how the artifact is stored, not a runtime mode you toggle.

With vs Without `fraqtl-runtime`

Mode	What happens
Without `fraqtl-runtime` installed	`from_pretrained(..., trust_remote_code=True)` raises `ImportError`. Model does not load.
With `fraqtl-runtime` installed	Model loads normally via `from_pretrained`. Compression is built into the loader — no manual API calls needed. You get 2.74× smaller weights (25.53 GB on disk vs ~70 GB FP16) and 128K context on a single A100-80GB.

Why a wheel: the loader reconstructs INT3-packed MoE expert tensors back into the form transformers expects. That codepath is shipped as a compiled binary (.so) so users never write import fraqtl themselves — trust_remote_code=True pulls a small stub from this repo that calls into the wheel's loader behind the scenes.

Installation

This model requires one extra package before you can load it from transformers:

pip install fraqtl-runtime

Why? The compressed weights need a custom loader that reconstructs the packed MoE expert tensors at load time. That loader ships in the fraqtl-runtime package as a compiled binary. You don't import it yourself — when you call AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True), transformers downloads a tiny stub from this repo that calls into the compiled loader behind the scenes. If the package is missing, you get a clear ImportError pointing you back here.

Free to install — no license token required for loading or inference. A token is only required for the optional runtime KV-cache compression layer (separate paid feature, see section below).

Platform: Linux x86_64 + CUDA validated. macOS / ARM coming.

Once installed, use the standard transformers API as shown in the Quick Start below.

Headline Results

	Upstream / Reference	fraQtl Compressed	Notes
Size on disk	~70 GB	25.53 GB	2.74× smaller
Peak VRAM @ 16K context	—	25.6 GB	single A100-80GB
Peak VRAM @ 64K context	OOM in tested stock setup	36.8 GB	stock setup exceeded VRAM
Peak VRAM @ 128K context	OOM in tested stock setup	51.7 GB	compressed fits
∞Bench Passkey @ ~125K tok	—	30/30 = 100%	compressed run
∞Bench KV-retrieval†	matched FP16 reference	matched on n=1/10/100/500/1500	80% (4/5) at every pair count, identical to FP16
WikiText-2 PPL	11.6041	11.6930	+0.77%
MMLU (full 14K, 57 subject)	82.40% (FP16 reference)	82.24%	−0.16pp vs FP16
HumanEval pass@1 (N=164)	61.59% (FP16 reference)	64.02%	tied within sample variance
HumanEval+ pass@1 (N=164)	54.27% (FP16 reference)	55.49%	tied within sample variance
BigCodeBench (N=200)	39.50% (FP16 reference)	39.50%	bit-identical at this N
TruthfulQA-MC2 (N=500)	~0.4553 (FP16 reference)	0.4493	tie within ±0.006

All measurements on a single NVIDIA A100-80GB. Single-seed sanity grade; multi-seed re-verification in progress.

† ∞Bench KV-retrieval diagnostic: matched FP16 reference behavior on n=1, 10, 100, 500, and 1500 needle settings (~101K-token context at n=1500). Compressed accuracy = 80% (4/5) at every pair count, identical to FP16. The single failing case reproduces at FP16 with the same example seed — it's a property of one specific example, not a depth/compression artifact. Peak VRAM at n=1500: 48.7 GB on a single A100-80GB.

Summary: ~2.74× smaller on disk, 128K context on a single A100-80GB (stock FP16 OOMs at 64K), MMLU within 0.16pp of FP16 and HumanEval within sample variance of FP16 baseline.

Quick Start

pip install fraqtl-runtime

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "fraQtl/Qwen3.6-35B-A3B-compressed"

model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained(repo)

prompt = "The capital of France is"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=40, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
# → "The capital of France is Paris, a city renowned for its rich history…"

Why the wheel is required (you don't `import fraqtl` directly)

The pip install fraqtl-runtime step is necessary even though the example above never writes import fraqtl. Here's the chain:

You call AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, ...).
transformers downloads modeling_qwen36_fraqtl.py from this repo (a small stub).
That stub does from fraqtl.modeling_qwen36_fraqtl import Qwen36CompressedForCausalLM.
The fraqtl-runtime wheel ships the compiled .so that this import resolves to.
The compiled loader reconstructs the packed MoE expert weights and returns the ready-to-use model.

No license token is required for loading — the wheel is free to install. A token is only required to enable the optional runtime KV-cache compression layer (see below).

If the wheel is missing, the import fails with a clear error pointing you back at pip install fraqtl-runtime. You never have to write import fraqtl yourself.

Linux x86_64 only for v0.1; macOS / ARM support coming.

What This Release Includes

This repository contains:

A public compressed-weight artifact for Qwen3.6-35B-A3B.
A lightweight loader for the compressed format.
Benchmark results: perplexity, MMLU sample, ∞Bench passkey, ∞Bench KV-retrieval, qualitative generation, VRAM profile across contexts.

This repository does not include:

The fraQtl calibration pipeline.
The full compression engine.
Runtime KV-cache compression.
Custom model-packing tools.

Those components are distributed through the fraQtl runtime for approved early-access users.

Runtime KV-Cache Compression (Separate Layer)

This artifact is weight-compressed only. fraQtl runtime KV-cache compression — the additional live-cache compression at long context — is distributed separately as the fraQtl runtime for approved early-access users.

It is not required to load or evaluate this model. To request access, contact fraQtl directly.

Benchmarks

Long-Context Retrieval

Benchmark	Cells	Result
∞Bench Passkey @ 125 315 tokens	30	30/30 = 100%
∞Bench KV-retrieval (n=1 / 10 / 100 / 500 / 1500)	5	matched FP16 baseline (80% (4/5) at every pair count)

Stock FP16 baseline OOMs at 64K context on the same GPU.

Perplexity

FP16/BF16 baseline : 11.6041
fraQtl compressed  : 11.6930
Delta              : +0.0889 (+0.77%)

WikiText-2 test split, 16 × 256-token windows, greedy decode.

MMLU (Full 14 042 questions, 57 subjects)

Source	Accuracy
Qwen3.6-35B-A3B FP16	82.40%
fraQtl compressed	82.24%
Δ	−0.16 pp

Effectively zero retention loss on MMLU vs the FP16 baseline. Matched shot-count and subject sampling against the FP16 reference run.

HumanEval pass@1 (full N=164)

Source	Pass@1
Qwen3.6-35B-A3B Base FP16 reference	61.59%
fraQtl compressed	64.02%
Δ	tied within sample variance

HumanEval has ~±3 pp 95% CI at N=164. We frame this as tied rather than "+2.43 pp gain" because we don't ship claims that could flip on a reviewer rerun. No measurable degradation on code generation relative to the FP16 baseline. Greedy decode, full 164 problems, single seed. Trajectory: the early-problem average (95% at N=20) converged toward the baseline as harder later problems came in, consistent with HumanEval's known easier-early/harder-late ordering.

HumanEval+ pass@1 (full N=164)

Source	Pass@1
Qwen3.6-35B-A3B Base FP16 reference	54.27%
fraQtl compressed	55.49%
Δ	tied within sample variance

Same harness as HumanEval. Same conservative framing — within sample variance at this N, no measurable degradation on extended code-generation prompts.

BigCodeBench (N=200)

Source	Pass@1
Qwen3.6-35B-A3B Base FP16 reference	39.50% (79/200)
fraQtl compressed	39.50% (79/200)
Δ	bit-identical at this N

Identical pass count on identical problems. Strongest single-axis evidence that the compressed weights preserve the FP16 reasoning trajectory on a non-trivial code benchmark.

TruthfulQA-MC2 (N=500)

Source	MC2 score
Qwen3.6-35B-A3B Base FP16 reference	~0.4553
fraQtl compressed	0.4493
Δ	tie within ±0.006

Tied within measurement noise. FP16 reference measured to 450/500 with final within ±0.005 of 0.4553.

Qualitative Generation (Greedy)

Prompt	Result
`The capital of France is`	Coherent factual completion
`A transformer neural network is`	Coherent technical explanation
`Python lists differ from tuples in that`	Correct mutability distinction
`The French Revolution began in the year`	Correct factual completion

No loops or obvious gibberish observed across the qualitative smoke set.

Throughput (single A100-80GB)

Context	Tokens / sec	Peak VRAM
4 K	0.88	26.4 GB
16 K	0.85	29.2 GB
32 K	0.82	33.0 GB

100 fresh tokens generated after a real prefill at each context length, measured after one warmup pass.

Honest disclosure. This is not a real-time chat backend. The PackedFusedExperts loader dequantizes only the routed MoE experts per forward pass — that trades persistent VRAM (25 GB resident vs ~70 GB FP16) for per-token compute. At ~0.85 tok/s, the right use cases are long-context analysis, document processing, batch inference, and research workloads, not interactive chat.

Long-context perplexity (WikiText-103)

Context	PPL	Windows averaged
4 K	5.3420	4
16 K	5.9521	4

Modest context-length growth consistent with FP16 baselines on this corpus. 32 K PPL not measured here (transformers' default loss path materializes full fp32 logits and OOMs on the long context — a measurement-tooling issue, not a model failure; 32 K throughput and ∞Bench retrieval at 125 K both pass on the same hardware).

Calibration Corpus and Task-Aware Variants

This artifact (v4) uses WikiText-2 as the calibration corpus — chosen for general-purpose preservation across the benchmarks above. Calibration corpus matters: it determines which input directions are protected during compression, and protected directions correlate most strongly with downstream tasks that share input distribution with the calibration data.

GSM8K-CoT 5-shot is intentionally omitted from the headline table. We measured v4 GSM8K-CoT 5-shot at 57.67% under the matched harness, vs FP16 baseline at 81.00%. WikiText calibration biases protected dimensions toward general-text fidelity, which trades off math chain-of-thought reasoning. We disclose the gap rather than hide it.

A separate math-corpus-calibrated variant (MBPP calibration, internal codename Phase 1.5) recovers and exceeds FP16 on GSM8K-CoT under the same harness:

Source	GSM8K-CoT 5-shot (N=300)
Qwen3.6-35B-A3B Base FP16 reference	81.00%
`v4` (WikiText calibration, this artifact)	57.67%
Phase 1.5 (MBPP calibration)	91.67%

Phase 1.5 is releasing as a separate artifact in the coming days (fraQtl/Qwen3.6-35B-A3B-compressed-math or similar) with its own model card and full matched-FP16 sweep. We're publishing it as a study on calibration-corpus / task-aware compression — empirical evidence that the right calibration corpus matters as much as the quantization scheme.

Pick the right artifact for your workload:

General-purpose / code / long-context retrieval → v4 (this artifact)
Math-CoT-heavy → wait for the task-aware variant

Limitations

Single-seed sanity grade. Long-context numbers (passkey, KV-retrieval, VRAM) and MMLU are single-seed. Multi-seed re-verification in progress.
Long-context workload variance. Evaluate on your own long-context workload before production use.
English MMLU only. Multilingual performance has not been systematically re-measured after compression.
transformers version. Requires transformers >= 4.51 for the Qwen 3.6 MoE architecture. Older versions don't recognize the model type.
Platform. Linux x86_64 with CUDA validated. Untested on Apple Silicon / AMD ROCm.
Fine-tuning. Out of scope. The compressed artifact is inference-only.
Live KV-cache savings require the fraQtl runtime (early access). Without it, only the weight compression is active; KV cache stays FP16.
Runtime memory depends on context length, batch size, backend, and hardware.

License

The Qwen3.6-35B-A3B base model is licensed by Alibaba under its own license terms. fraQtl's compression artifacts and loader code in this repository are Copyright © 2026 fraQtl AI. The compressed weights are released for research and evaluation use; see LICENSE for terms. The fraQtl runtime package and full compression engine are proprietary; contact for commercial licensing.

Measure Your Own Model

fraQtl maintains an open diagnostic tool for measuring model compressibility before deeper compression workflows:

pip install fraqtl-diagnostic
fraqtl analyze <any-HF-model>

Source: https://github.com/fraqtl-ai/fraqtl-diagnostic
PyPI: https://pypi.org/project/fraqtl-diagnostic/
Apache-2.0 (the diagnostic only — not the compression engine).

Contact

Site: https://fraqtl.ai
Contact / pilots: https://fraqtl.ai/contact or contact@fraqtl.ai
GitHub: https://github.com/fraqtl-ai
Hugging Face: https://huggingface.co/fraQtl

Downloads last month: 2,529

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for fraQtl/Qwen3.6-35B-A3B-compressed

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

(128)

this model