Qwen3-ASR-0.6B β€” Full ONNX CPU

Run Qwen3-ASR on any CPU. No GPU. No PyTorch.

Self-contained ONNX pipeline for Qwen3-ASR-0.6B. Encoder and decoder both run on ONNX Runtime with INT8 quantized decoder. Long audio is automatically split at silence boundaries β€” no manual chunking needed.

Highlights

  • Zero GPU, Zero PyTorch β€” onnxruntime + librosa + tokenizers only
  • Real-time on 8W CPU β€” RTF 0.71x with VAD chunking on Intel N100
  • 3x realtime on desktop β€” RTF 0.32x (INT8)
  • Self-contained β€” all weights + tokenizer included
  • 30 languages β€” same multilingual coverage as the original
  • Long audio support β€” automatic silence-based splitting with --chunk-sec
  • Bit-exact encoder β€” cosine similarity 1.000000 vs PyTorch reference

Quick Start

git clone https://huggingface.co/Daumee/Qwen3-ASR-0.6B-ONNX-CPU
cd Qwen3-ASR-0.6B-ONNX-CPU

python3 -m venv .venv && source .venv/bin/activate
pip install onnxruntime librosa soundfile tokenizers

# Short audio
python onnx_inference.py test_audio/librispeech_1_1089_1.wav

# Long audio (auto-chunked at silence)
python onnx_inference.py long_meeting.wav --chunk-sec 30

# Specify language + JSON output
python onnx_inference.py audio.wav --language Korean --json

Benchmarks

Intel N100 (4 cores, 8W TDP)

Short audio β€” 13 LibriSpeech test-clean samples, INT8 decoder:

Audio RTF Encoder Prefill Decode Tokens
12.4s 0.97x 3.8s 3.2s 5.1s 49
11.6s 0.84x 2.4s 3.0s 4.3s 39
10.6s 0.86x 2.3s 2.9s 3.8s 32
10.4s 0.92x 1.6s 2.5s 4.5s 41
6.6s 1.08x 1.5s 2.3s 3.4s 27
3.3s 1.36x 1.3s 1.4s 1.7s 15

Audio > 10s achieves realtime or faster. Decoder: ~100ms/token (INT8).

Long audio β€” production deployment with VAD chunking (Docker, 2 threads):

Input Chunks Avg RTF Peak Memory
600s (10 min) 19 0.71x 5.7 GB

Without chunking, 10-min audio consumes 15GB+ and gets OOM-killed. With 30s chunks, prefill cost is amortized β†’ RTF drops well below 1.0x.

Desktop x86_64:

Mode RTF Notes
ONNX FP32 0.63x No quantization
ONNX INT8 0.32x 3x realtime

Architecture

Stage Runtime Component Details
1 librosa Mel Spectrogram 16kHz β†’ 128-bin log-mel
2 ONNX Runtime encoder_conv.onnx 3x Conv2D, 8x downsample
3 ONNX Runtime encoder_transformer.onnx 18 Transformer layers + Projector (896β†’1024)
4 NumPy embed_tokens.bin Fuse audio features into prompt
5 ONNX Runtime decoder_init.int8.onnx Prefill β†’ logits + KV cache
6 ONNX Runtime decoder_step.int8.onnx Autoregressive decode until EOS
Encoder Decoder
Quantization FP32 Dynamic INT8
Format 2 models (conv + transformer) 2 models (init + step)
KV Cache β€” ONNX I/O

Long Audio

Audio longer than 45s is automatically split at silence boundaries using RMS energy detection. No external VAD model needed.

python onnx_inference.py meeting.wav                  # 30s chunks (default)
python onnx_inference.py meeting.wav --chunk-sec 20   # 20s chunks, less memory

Split range scales with target: min = target/2, max = targetΓ—1.5. The split point is the silence frame nearest to the target length.

Files

File Size Description
onnx_inference.py β€” Inference CLI (single file, no deps beyond pip)
tokenizer.json 11 MB Self-contained tokenizer
onnx_models/encoder_conv.onnx 48 MB Conv block
onnx_models/encoder_transformer.onnx 669 MB Transformer + Projector
onnx_models/decoder_init.int8.onnx 571 MB Prefill (INT8)
onnx_models/decoder_step.int8.onnx 571 MB Decode step (INT8)
onnx_models/embed_tokens.bin 622 MB Token embeddings

Total: ~2.5 GB

Model

Component Params Details
Audio Encoder ~310M d=896, 18 layers, 14 heads
Projector ~1.7M Linear 896β†’1024
LLM Decoder ~470M d=1024, 28 layers, GQA 16Q/8KV
Total ~782M

30 languages: Chinese, English, Cantonese, Japanese, Korean, Arabic, German, French, Spanish, Portuguese, Indonesian, Italian, Russian, Thai, Vietnamese, Turkish, Hindi, Malay, Dutch, Swedish, Danish, Finnish, Polish, Czech, Filipino, Persian, Greek, Romanian, Hungarian, Macedonian

Technical Notes

  • Attention: Original cu_seqlens windowed attention only works with flash_attention_2. CPU uses eager mode (all-to-all). ONNX export matches this.
  • Weight tying: embed_tokens = lm_head in original. In ONNX, separated β€” embed_tokens.bin for input, lm_head baked into decoder.
  • KV Cache: [num_layers, batch, kv_heads, seq_len, head_dim] β€” init outputs, step extends.
  • MRoPE: Layout [24,20,20]. For ASR (no vision), all 3 dims share the same position IDs.

Dependencies

onnxruntime
librosa
soundfile
tokenizers

No PyTorch. No transformers. No CUDA.

Acknowledgements

Decoder ONNX export architecture inspired by andrewleech/qwen3-asr-onnx.

References

License

Code: Apache 2.0. Model weights: original license.

Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for wolfofbackstreet/Qwen3-ASR-0.6B-ONNX-CPU

Quantized
(11)
this model

Paper for wolfofbackstreet/Qwen3-ASR-0.6B-ONNX-CPU