Qwen3.6-35B-A3B — fraQtl Compressed

A compressed Qwen3.6-35B-A3B artifact. Reduces model footprint and unlocks long-context inference on a single GPU while preserving core generation quality.

This release is the public weight-compressed artifact. fraQtl runtime KV-cache compression is a separate early-access layer.

For production inference pilots, custom compression workflows, and newer artifacts, see fraqtl.ai/contact or email contact@fraqtl.ai.

⚠️ Required: install fraqtl-runtime first

Without fraqtl-runtime, this model will not load. The wheel ships the compiled loader that decodes the packed compressed weights. There is no Python-level "enable" call to make — install the wheel, then from_pretrained works.

pip install fraqtl-runtime
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "fraQtl/Qwen3.6-35B-A3B-compressed"
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True,
    torch_dtype=torch.bfloat16, device_map="auto",
)
tok = AutoTokenizer.from_pretrained(repo)

ids = tok("The capital of France is", return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**ids, max_new_tokens=20, do_sample=False)[0]))
# → "The capital of France is Paris, a city renowned for..."

That's the entire setup. Compression is part of how the artifact is stored, not a runtime mode you toggle.

With vs Without fraqtl-runtime

Mode What happens
Without fraqtl-runtime installed from_pretrained(..., trust_remote_code=True) raises ImportError. Model does not load.
With fraqtl-runtime installed Model loads normally via from_pretrained. Compression is built into the loader — no manual API calls needed. You get 2.74× smaller weights (25.53 GB on disk vs ~70 GB FP16) and 128K context on a single A100-80GB.

Why a wheel: the loader reconstructs INT3-packed MoE expert tensors back into the form transformers expects. That codepath is shipped as a compiled binary (.so) so users never write import fraqtl themselves — trust_remote_code=True pulls a small stub from this repo that calls into the wheel's loader behind the scenes.


Installation

This model requires one extra package before you can load it from transformers:

pip install fraqtl-runtime

Why? The compressed weights need a custom loader that reconstructs the packed MoE expert tensors at load time. That loader ships in the fraqtl-runtime package as a compiled binary. You don't import it yourself — when you call AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True), transformers downloads a tiny stub from this repo that calls into the compiled loader behind the scenes. If the package is missing, you get a clear ImportError pointing you back here.

Free to install — no license token required for loading or inference. A token is only required for the optional runtime KV-cache compression layer (separate paid feature, see section below).

Platform: Linux x86_64 + CUDA validated. macOS / ARM coming.

Once installed, use the standard transformers API as shown in the Quick Start below.


Headline Results

Upstream / Reference fraQtl Compressed Notes
Size on disk ~70 GB 25.53 GB 2.74× smaller
Peak VRAM @ 16K context 25.6 GB single A100-80GB
Peak VRAM @ 64K context OOM in tested stock setup 36.8 GB stock setup exceeded VRAM
Peak VRAM @ 128K context OOM in tested stock setup 51.7 GB compressed fits
∞Bench Passkey @ ~125K tok 30/30 = 100% compressed run
∞Bench KV-retrieval† matched FP16 reference matched on n=1/10/100/500/1500 80% (4/5) at every pair count, identical to FP16
WikiText-2 PPL 11.6041 11.6930 +0.77%
MMLU (full 14K, 57 subject) 82.40% (FP16 reference) 82.24% −0.16pp vs FP16
HumanEval pass@1 (N=164) 61.59% (FP16 reference) 64.02% tied within sample variance
HumanEval+ pass@1 (N=164) 54.27% (FP16 reference) 55.49% tied within sample variance
BigCodeBench (N=200) 39.50% (FP16 reference) 39.50% bit-identical at this N
TruthfulQA-MC2 (N=500) ~0.4553 (FP16 reference) 0.4493 tie within ±0.006

All measurements on a single NVIDIA A100-80GB. Single-seed sanity grade; multi-seed re-verification in progress.

† ∞Bench KV-retrieval diagnostic: matched FP16 reference behavior on n=1, 10, 100, 500, and 1500 needle settings (~101K-token context at n=1500). Compressed accuracy = 80% (4/5) at every pair count, identical to FP16. The single failing case reproduces at FP16 with the same example seed — it's a property of one specific example, not a depth/compression artifact. Peak VRAM at n=1500: 48.7 GB on a single A100-80GB.

Summary: ~2.74× smaller on disk, 128K context on a single A100-80GB (stock FP16 OOMs at 64K), MMLU within 0.16pp of FP16 and HumanEval within sample variance of FP16 baseline.


Quick Start

pip install fraqtl-runtime
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "fraQtl/Qwen3.6-35B-A3B-compressed"

model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained(repo)

prompt = "The capital of France is"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=40, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
# → "The capital of France is Paris, a city renowned for its rich history…"

Why the wheel is required (you don't import fraqtl directly)

The pip install fraqtl-runtime step is necessary even though the example above never writes import fraqtl. Here's the chain:

  1. You call AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, ...).
  2. transformers downloads modeling_qwen36_fraqtl.py from this repo (a small stub).
  3. That stub does from fraqtl.modeling_qwen36_fraqtl import Qwen36CompressedForCausalLM.
  4. The fraqtl-runtime wheel ships the compiled .so that this import resolves to.
  5. The compiled loader reconstructs the packed MoE expert weights and returns the ready-to-use model.

No license token is required for loading — the wheel is free to install. A token is only required to enable the optional runtime KV-cache compression layer (see below).

If the wheel is missing, the import fails with a clear error pointing you back at pip install fraqtl-runtime. You never have to write import fraqtl yourself.

Linux x86_64 only for v0.1; macOS / ARM support coming.


What This Release Includes

This repository contains:

  • A public compressed-weight artifact for Qwen3.6-35B-A3B.
  • A lightweight loader for the compressed format.
  • Benchmark results: perplexity, MMLU sample, ∞Bench passkey, ∞Bench KV-retrieval, qualitative generation, VRAM profile across contexts.

This repository does not include:

  • The fraQtl calibration pipeline.
  • The full compression engine.
  • Runtime KV-cache compression.
  • Custom model-packing tools.

Those components are distributed through the fraQtl runtime for approved early-access users.


Runtime KV-Cache Compression (Separate Layer)

This artifact is weight-compressed only. fraQtl runtime KV-cache compression — the additional live-cache compression at long context — is distributed separately as the fraQtl runtime for approved early-access users.

It is not required to load or evaluate this model. To request access, contact fraQtl directly.


Benchmarks

Long-Context Retrieval

Benchmark Cells Result
∞Bench Passkey @ 125 315 tokens 30 30/30 = 100%
∞Bench KV-retrieval (n=1 / 10 / 100 / 500 / 1500) 5 matched FP16 baseline (80% (4/5) at every pair count)

Stock FP16 baseline OOMs at 64K context on the same GPU.

Perplexity

FP16/BF16 baseline : 11.6041
fraQtl compressed  : 11.6930
Delta              : +0.0889 (+0.77%)

WikiText-2 test split, 16 × 256-token windows, greedy decode.

MMLU (Full 14 042 questions, 57 subjects)

Source Accuracy
Qwen3.6-35B-A3B FP16 82.40%
fraQtl compressed 82.24%
Δ −0.16 pp

Effectively zero retention loss on MMLU vs the FP16 baseline. Matched shot-count and subject sampling against the FP16 reference run.

HumanEval pass@1 (full N=164)

Source Pass@1
Qwen3.6-35B-A3B Base FP16 reference 61.59%
fraQtl compressed 64.02%
Δ tied within sample variance

HumanEval has ~±3 pp 95% CI at N=164. We frame this as tied rather than "+2.43 pp gain" because we don't ship claims that could flip on a reviewer rerun. No measurable degradation on code generation relative to the FP16 baseline. Greedy decode, full 164 problems, single seed. Trajectory: the early-problem average (95% at N=20) converged toward the baseline as harder later problems came in, consistent with HumanEval's known easier-early/harder-late ordering.

HumanEval+ pass@1 (full N=164)

Source Pass@1
Qwen3.6-35B-A3B Base FP16 reference 54.27%
fraQtl compressed 55.49%
Δ tied within sample variance

Same harness as HumanEval. Same conservative framing — within sample variance at this N, no measurable degradation on extended code-generation prompts.

BigCodeBench (N=200)

Source Pass@1
Qwen3.6-35B-A3B Base FP16 reference 39.50% (79/200)
fraQtl compressed 39.50% (79/200)
Δ bit-identical at this N

Identical pass count on identical problems. Strongest single-axis evidence that the compressed weights preserve the FP16 reasoning trajectory on a non-trivial code benchmark.

TruthfulQA-MC2 (N=500)

Source MC2 score
Qwen3.6-35B-A3B Base FP16 reference ~0.4553
fraQtl compressed 0.4493
Δ tie within ±0.006

Tied within measurement noise. FP16 reference measured to 450/500 with final within ±0.005 of 0.4553.

Qualitative Generation (Greedy)

Prompt Result
The capital of France is Coherent factual completion
A transformer neural network is Coherent technical explanation
Python lists differ from tuples in that Correct mutability distinction
The French Revolution began in the year Correct factual completion

No loops or obvious gibberish observed across the qualitative smoke set.

Throughput (single A100-80GB)

Context Tokens / sec Peak VRAM
4 K 0.88 26.4 GB
16 K 0.85 29.2 GB
32 K 0.82 33.0 GB

100 fresh tokens generated after a real prefill at each context length, measured after one warmup pass.

Honest disclosure. This is not a real-time chat backend. The PackedFusedExperts loader dequantizes only the routed MoE experts per forward pass — that trades persistent VRAM (25 GB resident vs ~70 GB FP16) for per-token compute. At ~0.85 tok/s, the right use cases are long-context analysis, document processing, batch inference, and research workloads, not interactive chat.

Long-context perplexity (WikiText-103)

Context PPL Windows averaged
4 K 5.3420 4
16 K 5.9521 4

Modest context-length growth consistent with FP16 baselines on this corpus. 32 K PPL not measured here (transformers' default loss path materializes full fp32 logits and OOMs on the long context — a measurement-tooling issue, not a model failure; 32 K throughput and ∞Bench retrieval at 125 K both pass on the same hardware).


Calibration Corpus and Task-Aware Variants

This artifact (v4) uses WikiText-2 as the calibration corpus — chosen for general-purpose preservation across the benchmarks above. Calibration corpus matters: it determines which input directions are protected during compression, and protected directions correlate most strongly with downstream tasks that share input distribution with the calibration data.

GSM8K-CoT 5-shot is intentionally omitted from the headline table. We measured v4 GSM8K-CoT 5-shot at 57.67% under the matched harness, vs FP16 baseline at 81.00%. WikiText calibration biases protected dimensions toward general-text fidelity, which trades off math chain-of-thought reasoning. We disclose the gap rather than hide it.

A separate math-corpus-calibrated variant (MBPP calibration, internal codename Phase 1.5) recovers and exceeds FP16 on GSM8K-CoT under the same harness:

Source GSM8K-CoT 5-shot (N=300)
Qwen3.6-35B-A3B Base FP16 reference 81.00%
v4 (WikiText calibration, this artifact) 57.67%
Phase 1.5 (MBPP calibration) 91.67%

Phase 1.5 is releasing as a separate artifact in the coming days (fraQtl/Qwen3.6-35B-A3B-compressed-math or similar) with its own model card and full matched-FP16 sweep. We're publishing it as a study on calibration-corpus / task-aware compression — empirical evidence that the right calibration corpus matters as much as the quantization scheme.

Pick the right artifact for your workload:

  • General-purpose / code / long-context retrieval → v4 (this artifact)
  • Math-CoT-heavy → wait for the task-aware variant

Limitations

  • Single-seed sanity grade. Long-context numbers (passkey, KV-retrieval, VRAM) and MMLU are single-seed. Multi-seed re-verification in progress.
  • Long-context workload variance. Evaluate on your own long-context workload before production use.
  • English MMLU only. Multilingual performance has not been systematically re-measured after compression.
  • transformers version. Requires transformers >= 4.51 for the Qwen 3.6 MoE architecture. Older versions don't recognize the model type.
  • Platform. Linux x86_64 with CUDA validated. Untested on Apple Silicon / AMD ROCm.
  • Fine-tuning. Out of scope. The compressed artifact is inference-only.
  • Live KV-cache savings require the fraQtl runtime (early access). Without it, only the weight compression is active; KV cache stays FP16.
  • Runtime memory depends on context length, batch size, backend, and hardware.

License

The Qwen3.6-35B-A3B base model is licensed by Alibaba under its own license terms. fraQtl's compression artifacts and loader code in this repository are Copyright © 2026 fraQtl AI. The compressed weights are released for research and evaluation use; see LICENSE for terms. The fraQtl runtime package and full compression engine are proprietary; contact for commercial licensing.


Measure Your Own Model

fraQtl maintains an open diagnostic tool for measuring model compressibility before deeper compression workflows:

pip install fraqtl-diagnostic
fraqtl analyze <any-HF-model>

Contact

Downloads last month
2,529
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fraQtl/Qwen3.6-35B-A3B-compressed

Finetuned
(128)
this model