Instructions to use aufklarer/Instruct-MusicGen-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use aufklarer/Instruct-MusicGen-MLX-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Instruct-MusicGen-MLX-4bit aufklarer/Instruct-MusicGen-MLX-4bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Instruct-MusicGen-MLX-4bit
- speech-swift β Apple SDK
- soniqo.audio β website
- blog β blog
MLX port of Instruct-MusicGen β text-instructed music editing. Built on MusicGen-large (3.3B params, 48-layer autoregressive transformer over EnCodec 32 kHz tokens) with cross-attention base weights from the upstream checkpoint, LoRA-merged on Q/V (Ξ±/r = 2.0), plus a 48-layer CPTransformer adapter that injects the input audio's per-layer Q/K/V via prefix-attention into every self-attention block.
Inputs / Outputs
- Input: text instruction (e.g.
"Music piece. Instruct: Only Drums.") + input audio (mono float32 @ 32 kHz, β€ 10 s window) - Output: edited audio (mono float32 @ 32 kHz, matches input length)
Performance (Apple Silicon, INT4)
| Metric | Value |
|---|---|
| Bundle size | ~2.2 GB on disk |
| RTF (wall / audio) | ~1.21 (for 5 s output @ 250 AR steps) |
| Peak RSS | ~3 GB |
Quality (CLAP score vs instruction)
Mean CLAP score (laion/clap-htsat-unfused) across 4 edit instructions on a MusicGen-generated input clip β output vs the instruction text:
| Variant | mean CLAP | "Only Drums" | "Only Piano" | "Remove Drums" | "Only Bass" |
|---|---|---|---|---|---|
| FP16 | +0.352 | +0.40 | +0.36 | +0.42 | +0.22 |
| INT4 (this bundle) | +0.311 | +0.45 | +0.17 | +0.40 | +0.21 |
| INT8 | +0.311 | +0.44 | +0.20 | +0.39 | +0.21 |
INT4 β INT8 in CLAP, both within ~12 % of FP16. The "Only X" instructions generally produce a positive Ξ vs the input clip's CLAP score β i.e. the edit moves the audio toward the instruction. "Only Bass" remains the hardest case.
Usage (sketch)
from huggingface_hub import snapshot_download
bundle = snapshot_download("aufklarer/Instruct-MusicGen-MLX-4bit")
# Production loader: https://github.com/soniqo/speech-swift
# Minimal MLX sketch:
# 1. Read bundle/config.json (HF MusicGen config + cp_transformer metadata)
# 2. Construct InstructMusicGen MLX class
# 3. Replay quantization on linear projections (mlx.nn.quantize, bits=4)
# 4. Load weights from bundle/model.safetensors
# 5. audio = model.generate(text, input_audio, max_steps=250)
Architecture details
text instruction ββ T5-base ββ [LoRA-merged] cross-attn βββ
β
input audio β EnCodec encode β CPTransformer β prefix Q/K/V βββ
β
ββββββββββββββββββββββββ
βΌ
MusicGen-large LM (48 AR layers, delay pattern)
β
βΌ
EnCodec decoder β 32 kHz wav
- CPTransformer: shares the base LM's transformer blocks (norm/self-attn/FFN)
but adds learned
pos_emb(49, 501, 2048),merge_linear[i]per layer (2048 β 2048), and a zero-initgate[i]scalar. - Prefix injection (per self-attn): second SDPA over the input audio's
K/V, with
dt_q = prefix_q[step] + main_q, gated add beforeout_proj:attn = main_attn + dt_attn Γ gate[i].
Files
model.safetensorsβ quantized LM (INT4 affine, group size 64) + adapter weightsconfig.jsonβ architecture + quantization + instruct metadatacompression_state_dict.binβ passthrough of upstream EnCodec for offline init
Source
- Upstream: ldzhangyx/instruct-MusicGen (CC-BY-NC, re-trained on public datasets)
- Paper: Instruct-MusicGen (arxiv 2405.18386)
- Base: facebook/musicgen-large
License
CC-BY-NC 4.0 β inherited from MusicGen + the upstream checkpoint. Non-commercial use only.
- Downloads last month
- 53
Quantized
Model tree for aufklarer/Instruct-MusicGen-MLX-4bit
Base model
facebook/musicgen-large