Instructions to use aufklarer/FlashSR-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use aufklarer/FlashSR-MLX-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir FlashSR-MLX-4bit aufklarer/FlashSR-MLX-4bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
FlashSR-MLX-4bit
- speech-swift — Apple SDK
- soniqo.audio — website
- blog — blog
MLX port of FlashSR — a single-step distilled latent-diffusion audio super-resolution model — quantized to INT4 weight-only for on-device inference on Apple Silicon. Upsamples any-rate input (mono) to 48 kHz in one diffusion pass with a specialised BigVGAN-flavour SR vocoder. FlashSR is the distilled student of AudioSR.
Model
| Total parameters | 638 M (VAE 223 M + UNet 258 M + Vocoder 157 M) |
| Diffusion steps | 1 (distilled v-prediction student, cosine schedule, t=999) |
| Quantization | INT4 weight-only, group size 64, mode mlx_affine_flat |
| Format | MLX safetensors (single combined bundle) |
| Sample rate | 48 kHz mono out (any-rate mono in) |
| Frame length | 5.12 s (245 760 samples) per forward |
| Bundle size | 346 MB on disk |
| Source | jakeoneijk/FlashSR_Inference |
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
346 MB | INT4-quantized VAE + UNet + SR Vocoder weights |
config.json |
~70 KB | Sub-model configs + quantization metadata + original shape table for dequant-on-load |
The three sub-models share one safetensors file with vae.*, ldm.*, voc.*
key prefixes. config.quantized_shapes records each tensor's pre-flatten shape
so mx.dequantize can rebuild conv weight tensors at load time.
Performance (Apple Silicon, M-series, 5.12 s @ 48 kHz)
| Metric | Value |
|---|---|
| Real-time factor (wall / audio) | 1.10 |
| Load time | 0.17 s (dequant materialises bf16 weights once) |
| SNR vs FP16 reference | +29.4 dB |
| Cosine similarity vs FP16 | 0.9994 |
| Peak amplitude preservation | 1.000 |
INT4 is the recommended deployment variant — well above music-perceptual threshold and the smallest viable on-device bundle.
Usage
from huggingface_hub import snapshot_download
import mlx.core as mx
import numpy as np
import scipy.io.wavfile as wf
from scipy.signal import resample_poly
bundle = snapshot_download("aufklarer/FlashSR-MLX-4bit")
# See https://github.com/soniqo/speech-swift for production usage.
# Toy Python demo (requires the matching MLX FlashSR runtime):
sr, audio = wf.read("lr.wav")
audio = audio.astype(np.float32) / 32767.0
audio_48 = resample_poly(audio, 48000, sr).astype(np.float32)
from flashsr import FlashSR # from this repo's export/
model = FlashSR(bundle)
hr = model(mx.array(audio_48), seed=42)
mx.eval(hr)
wf.write("hr.wav", 48000, (np.clip(np.array(hr), -1, 1) * 32767).astype(np.int16))
Source
- Upstream: jakeoneijk/FlashSR_Inference
- Paper: FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation
- Teacher model: AudioSR
License
CC-BY-NC 4.0 — inherited from upstream FlashSR weights. Non-commercial use only.
- Downloads last month
- -
Quantized