Instructions to use fraQtl/Qwen3.6-35B-A3B-compressed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use fraQtl/Qwen3.6-35B-A3B-compressed with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="fraQtl/Qwen3.6-35B-A3B-compressed", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("fraQtl/Qwen3.6-35B-A3B-compressed", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use fraQtl/Qwen3.6-35B-A3B-compressed with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "fraQtl/Qwen3.6-35B-A3B-compressed" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fraQtl/Qwen3.6-35B-A3B-compressed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/fraQtl/Qwen3.6-35B-A3B-compressed
- SGLang
How to use fraQtl/Qwen3.6-35B-A3B-compressed with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "fraQtl/Qwen3.6-35B-A3B-compressed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fraQtl/Qwen3.6-35B-A3B-compressed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "fraQtl/Qwen3.6-35B-A3B-compressed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fraQtl/Qwen3.6-35B-A3B-compressed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use fraQtl/Qwen3.6-35B-A3B-compressed with Docker Model Runner:
docker model run hf.co/fraQtl/Qwen3.6-35B-A3B-compressed
Qwen3.6-35B-A3B — fraQtl Compressed
A compressed Qwen3.6-35B-A3B artifact. Reduces model footprint and unlocks long-context inference on a single GPU while preserving core generation quality.
This release is the public weight-compressed artifact. fraQtl runtime KV-cache compression is a separate early-access layer.
For production inference pilots, custom compression workflows, and newer artifacts, see fraqtl.ai/contact or email contact@fraqtl.ai.
⚠️ Required: install fraqtl-runtime first
Without
fraqtl-runtime, this model will not load. The wheel ships the compiled loader that decodes the packed compressed weights. There is no Python-level "enable" call to make — install the wheel, thenfrom_pretrainedworks.
pip install fraqtl-runtime
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
repo = "fraQtl/Qwen3.6-35B-A3B-compressed"
model = AutoModelForCausalLM.from_pretrained(
repo, trust_remote_code=True,
torch_dtype=torch.bfloat16, device_map="auto",
)
tok = AutoTokenizer.from_pretrained(repo)
ids = tok("The capital of France is", return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**ids, max_new_tokens=20, do_sample=False)[0]))
# → "The capital of France is Paris, a city renowned for..."
That's the entire setup. Compression is part of how the artifact is stored, not a runtime mode you toggle.
With vs Without fraqtl-runtime
| Mode | What happens |
|---|---|
Without fraqtl-runtime installed |
from_pretrained(..., trust_remote_code=True) raises ImportError. Model does not load. |
With fraqtl-runtime installed |
Model loads normally via from_pretrained. Compression is built into the loader — no manual API calls needed. You get 2.74× smaller weights (25.53 GB on disk vs ~70 GB FP16) and 128K context on a single A100-80GB. |
Why a wheel: the loader reconstructs INT3-packed MoE expert tensors back into the form
transformersexpects. That codepath is shipped as a compiled binary (.so) so users never writeimport fraqtlthemselves —trust_remote_code=Truepulls a small stub from this repo that calls into the wheel's loader behind the scenes.
Installation
This model requires one extra package before you can load it from
transformers:
pip install fraqtl-runtime
Why? The compressed weights need a custom loader that reconstructs the
packed MoE expert tensors at load time. That loader ships in the
fraqtl-runtime package as a compiled binary. You don't import it
yourself — when you call AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True), transformers downloads a tiny stub from this repo
that calls into the compiled loader behind the scenes. If the package is
missing, you get a clear ImportError pointing you back here.
Free to install — no license token required for loading or inference. A token is only required for the optional runtime KV-cache compression layer (separate paid feature, see section below).
Platform: Linux x86_64 + CUDA validated. macOS / ARM coming.
Once installed, use the standard transformers API as shown in the
Quick Start below.
Headline Results
| Upstream / Reference | fraQtl Compressed | Notes | |
|---|---|---|---|
| Size on disk | ~70 GB | 25.53 GB | 2.74× smaller |
| Peak VRAM @ 16K context | — | 25.6 GB | single A100-80GB |
| Peak VRAM @ 64K context | OOM in tested stock setup | 36.8 GB | stock setup exceeded VRAM |
| Peak VRAM @ 128K context | OOM in tested stock setup | 51.7 GB | compressed fits |
| ∞Bench Passkey @ ~125K tok | — | 30/30 = 100% | compressed run |
| ∞Bench KV-retrieval† | matched FP16 reference | matched on n=1/10/100/500/1500 | 80% (4/5) at every pair count, identical to FP16 |
| WikiText-2 PPL | 11.6041 | 11.6930 | +0.77% |
| MMLU (full 14K, 57 subject) | 82.40% (FP16 reference) | 82.24% | −0.16pp vs FP16 |
| HumanEval pass@1 (N=164) | 61.59% (FP16 reference) | 64.02% | tied within sample variance |
| HumanEval+ pass@1 (N=164) | 54.27% (FP16 reference) | 55.49% | tied within sample variance |
| BigCodeBench (N=200) | 39.50% (FP16 reference) | 39.50% | bit-identical at this N |
| TruthfulQA-MC2 (N=500) | ~0.4553 (FP16 reference) | 0.4493 | tie within ±0.006 |
All measurements on a single NVIDIA A100-80GB. Single-seed sanity grade; multi-seed re-verification in progress.
† ∞Bench KV-retrieval diagnostic: matched FP16 reference behavior on n=1, 10, 100, 500, and 1500 needle settings (~101K-token context at n=1500). Compressed accuracy = 80% (4/5) at every pair count, identical to FP16. The single failing case reproduces at FP16 with the same example seed — it's a property of one specific example, not a depth/compression artifact. Peak VRAM at n=1500: 48.7 GB on a single A100-80GB.
Summary: ~2.74× smaller on disk, 128K context on a single A100-80GB (stock FP16 OOMs at 64K), MMLU within 0.16pp of FP16 and HumanEval within sample variance of FP16 baseline.
Quick Start
pip install fraqtl-runtime
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
repo = "fraQtl/Qwen3.6-35B-A3B-compressed"
model = AutoModelForCausalLM.from_pretrained(
repo,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
tok = AutoTokenizer.from_pretrained(repo)
prompt = "The capital of France is"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=40, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
# → "The capital of France is Paris, a city renowned for its rich history…"
Why the wheel is required (you don't import fraqtl directly)
The pip install fraqtl-runtime step is necessary even though the example
above never writes import fraqtl. Here's the chain:
- You call
AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, ...). transformersdownloadsmodeling_qwen36_fraqtl.pyfrom this repo (a small stub).- That stub does
from fraqtl.modeling_qwen36_fraqtl import Qwen36CompressedForCausalLM. - The
fraqtl-runtimewheel ships the compiled.sothat this import resolves to. - The compiled loader reconstructs the packed MoE expert weights and returns the ready-to-use model.
No license token is required for loading — the wheel is free to install. A token is only required to enable the optional runtime KV-cache compression layer (see below).
If the wheel is missing, the import fails with a clear error pointing you back
at pip install fraqtl-runtime. You never have to write import fraqtl
yourself.
Linux x86_64 only for v0.1; macOS / ARM support coming.
What This Release Includes
This repository contains:
- A public compressed-weight artifact for Qwen3.6-35B-A3B.
- A lightweight loader for the compressed format.
- Benchmark results: perplexity, MMLU sample, ∞Bench passkey, ∞Bench KV-retrieval, qualitative generation, VRAM profile across contexts.
This repository does not include:
- The fraQtl calibration pipeline.
- The full compression engine.
- Runtime KV-cache compression.
- Custom model-packing tools.
Those components are distributed through the fraQtl runtime for approved early-access users.
Runtime KV-Cache Compression (Separate Layer)
This artifact is weight-compressed only. fraQtl runtime KV-cache compression — the additional live-cache compression at long context — is distributed separately as the fraQtl runtime for approved early-access users.
It is not required to load or evaluate this model. To request access, contact fraQtl directly.
Benchmarks
Long-Context Retrieval
| Benchmark | Cells | Result |
|---|---|---|
| ∞Bench Passkey @ 125 315 tokens | 30 | 30/30 = 100% |
| ∞Bench KV-retrieval (n=1 / 10 / 100 / 500 / 1500) | 5 | matched FP16 baseline (80% (4/5) at every pair count) |
Stock FP16 baseline OOMs at 64K context on the same GPU.
Perplexity
FP16/BF16 baseline : 11.6041
fraQtl compressed : 11.6930
Delta : +0.0889 (+0.77%)
WikiText-2 test split, 16 × 256-token windows, greedy decode.
MMLU (Full 14 042 questions, 57 subjects)
| Source | Accuracy |
|---|---|
| Qwen3.6-35B-A3B FP16 | 82.40% |
| fraQtl compressed | 82.24% |
| Δ | −0.16 pp |
Effectively zero retention loss on MMLU vs the FP16 baseline. Matched shot-count and subject sampling against the FP16 reference run.
HumanEval pass@1 (full N=164)
| Source | Pass@1 |
|---|---|
| Qwen3.6-35B-A3B Base FP16 reference | 61.59% |
| fraQtl compressed | 64.02% |
| Δ | tied within sample variance |
HumanEval has ~±3 pp 95% CI at N=164. We frame this as tied rather than "+2.43 pp gain" because we don't ship claims that could flip on a reviewer rerun. No measurable degradation on code generation relative to the FP16 baseline. Greedy decode, full 164 problems, single seed. Trajectory: the early-problem average (95% at N=20) converged toward the baseline as harder later problems came in, consistent with HumanEval's known easier-early/harder-late ordering.
HumanEval+ pass@1 (full N=164)
| Source | Pass@1 |
|---|---|
| Qwen3.6-35B-A3B Base FP16 reference | 54.27% |
| fraQtl compressed | 55.49% |
| Δ | tied within sample variance |
Same harness as HumanEval. Same conservative framing — within sample variance at this N, no measurable degradation on extended code-generation prompts.
BigCodeBench (N=200)
| Source | Pass@1 |
|---|---|
| Qwen3.6-35B-A3B Base FP16 reference | 39.50% (79/200) |
| fraQtl compressed | 39.50% (79/200) |
| Δ | bit-identical at this N |
Identical pass count on identical problems. Strongest single-axis evidence that the compressed weights preserve the FP16 reasoning trajectory on a non-trivial code benchmark.
TruthfulQA-MC2 (N=500)
| Source | MC2 score |
|---|---|
| Qwen3.6-35B-A3B Base FP16 reference | ~0.4553 |
| fraQtl compressed | 0.4493 |
| Δ | tie within ±0.006 |
Tied within measurement noise. FP16 reference measured to 450/500 with final within ±0.005 of 0.4553.
Qualitative Generation (Greedy)
| Prompt | Result |
|---|---|
The capital of France is |
Coherent factual completion |
A transformer neural network is |
Coherent technical explanation |
Python lists differ from tuples in that |
Correct mutability distinction |
The French Revolution began in the year |
Correct factual completion |
No loops or obvious gibberish observed across the qualitative smoke set.
Throughput (single A100-80GB)
| Context | Tokens / sec | Peak VRAM |
|---|---|---|
| 4 K | 0.88 | 26.4 GB |
| 16 K | 0.85 | 29.2 GB |
| 32 K | 0.82 | 33.0 GB |
100 fresh tokens generated after a real prefill at each context length, measured after one warmup pass.
Honest disclosure. This is not a real-time chat backend. The PackedFusedExperts loader dequantizes only the routed MoE experts per forward pass — that trades persistent VRAM (25 GB resident vs ~70 GB FP16) for per-token compute. At ~0.85 tok/s, the right use cases are long-context analysis, document processing, batch inference, and research workloads, not interactive chat.
Long-context perplexity (WikiText-103)
| Context | PPL | Windows averaged |
|---|---|---|
| 4 K | 5.3420 | 4 |
| 16 K | 5.9521 | 4 |
Modest context-length growth consistent with FP16 baselines on this corpus. 32 K PPL not measured here (transformers' default loss path materializes full fp32 logits and OOMs on the long context — a measurement-tooling issue, not a model failure; 32 K throughput and ∞Bench retrieval at 125 K both pass on the same hardware).
Calibration Corpus and Task-Aware Variants
This artifact (v4) uses WikiText-2 as the calibration corpus — chosen
for general-purpose preservation across the benchmarks above. Calibration
corpus matters: it determines which input directions are protected during
compression, and protected directions correlate most strongly with downstream
tasks that share input distribution with the calibration data.
GSM8K-CoT 5-shot is intentionally omitted from the headline table. We
measured v4 GSM8K-CoT 5-shot at 57.67% under the matched harness, vs
FP16 baseline at 81.00%. WikiText calibration biases protected dimensions
toward general-text fidelity, which trades off math chain-of-thought
reasoning. We disclose the gap rather than hide it.
A separate math-corpus-calibrated variant (MBPP calibration, internal codename Phase 1.5) recovers and exceeds FP16 on GSM8K-CoT under the same harness:
| Source | GSM8K-CoT 5-shot (N=300) |
|---|---|
| Qwen3.6-35B-A3B Base FP16 reference | 81.00% |
v4 (WikiText calibration, this artifact) |
57.67% |
| Phase 1.5 (MBPP calibration) | 91.67% |
Phase 1.5 is releasing as a separate artifact in the coming days
(fraQtl/Qwen3.6-35B-A3B-compressed-math or similar) with its own model
card and full matched-FP16 sweep. We're publishing it as a study on
calibration-corpus / task-aware compression — empirical evidence that
the right calibration corpus matters as much as the quantization
scheme.
Pick the right artifact for your workload:
- General-purpose / code / long-context retrieval →
v4(this artifact) - Math-CoT-heavy → wait for the task-aware variant
Limitations
- Single-seed sanity grade. Long-context numbers (passkey, KV-retrieval, VRAM) and MMLU are single-seed. Multi-seed re-verification in progress.
- Long-context workload variance. Evaluate on your own long-context workload before production use.
- English MMLU only. Multilingual performance has not been systematically re-measured after compression.
transformersversion. Requirestransformers >= 4.51for the Qwen 3.6 MoE architecture. Older versions don't recognize the model type.- Platform. Linux x86_64 with CUDA validated. Untested on Apple Silicon / AMD ROCm.
- Fine-tuning. Out of scope. The compressed artifact is inference-only.
- Live KV-cache savings require the fraQtl runtime (early access). Without it, only the weight compression is active; KV cache stays FP16.
- Runtime memory depends on context length, batch size, backend, and hardware.
License
The Qwen3.6-35B-A3B base model is licensed by Alibaba under its own license
terms. fraQtl's compression artifacts and loader code in this repository are
Copyright © 2026 fraQtl AI. The compressed weights are released for
research and evaluation use; see LICENSE for terms. The fraQtl runtime
package and full compression engine are proprietary; contact for commercial
licensing.
Measure Your Own Model
fraQtl maintains an open diagnostic tool for measuring model compressibility before deeper compression workflows:
pip install fraqtl-diagnostic
fraqtl analyze <any-HF-model>
- Source: https://github.com/fraqtl-ai/fraqtl-diagnostic
- PyPI: https://pypi.org/project/fraqtl-diagnostic/
- Apache-2.0 (the diagnostic only — not the compression engine).
Contact
- Site: https://fraqtl.ai
- Contact / pilots: https://fraqtl.ai/contact or contact@fraqtl.ai
- GitHub: https://github.com/fraqtl-ai
- Hugging Face: https://huggingface.co/fraQtl
- Downloads last month
- 2,529
Model tree for fraQtl/Qwen3.6-35B-A3B-compressed
Base model
Qwen/Qwen3.6-35B-A3B