Instruct-MusicGen-MLX-4bit

MLX port of Instruct-MusicGen β€” text-instructed music editing. Built on MusicGen-large (3.3B params, 48-layer autoregressive transformer over EnCodec 32 kHz tokens) with cross-attention base weights from the upstream checkpoint, LoRA-merged on Q/V (Ξ±/r = 2.0), plus a 48-layer CPTransformer adapter that injects the input audio's per-layer Q/K/V via prefix-attention into every self-attention block.

Inputs / Outputs

  • Input: text instruction (e.g. "Music piece. Instruct: Only Drums.") + input audio (mono float32 @ 32 kHz, ≀ 10 s window)
  • Output: edited audio (mono float32 @ 32 kHz, matches input length)

Performance (Apple Silicon, INT4)

Metric Value
Bundle size ~2.2 GB on disk
RTF (wall / audio) ~1.21 (for 5 s output @ 250 AR steps)
Peak RSS ~3 GB

Quality (CLAP score vs instruction)

Mean CLAP score (laion/clap-htsat-unfused) across 4 edit instructions on a MusicGen-generated input clip β€” output vs the instruction text:

Variant mean CLAP "Only Drums" "Only Piano" "Remove Drums" "Only Bass"
FP16 +0.352 +0.40 +0.36 +0.42 +0.22
INT4 (this bundle) +0.311 +0.45 +0.17 +0.40 +0.21
INT8 +0.311 +0.44 +0.20 +0.39 +0.21

INT4 β‰ˆ INT8 in CLAP, both within ~12 % of FP16. The "Only X" instructions generally produce a positive Ξ” vs the input clip's CLAP score β€” i.e. the edit moves the audio toward the instruction. "Only Bass" remains the hardest case.

Usage (sketch)

from huggingface_hub import snapshot_download
bundle = snapshot_download("aufklarer/Instruct-MusicGen-MLX-4bit")

# Production loader: https://github.com/soniqo/speech-swift
# Minimal MLX sketch:
#   1. Read bundle/config.json (HF MusicGen config + cp_transformer metadata)
#   2. Construct InstructMusicGen MLX class
#   3. Replay quantization on linear projections (mlx.nn.quantize, bits=4)
#   4. Load weights from bundle/model.safetensors
#   5. audio = model.generate(text, input_audio, max_steps=250)

Architecture details

text instruction ── T5-base ── [LoRA-merged] cross-attn ──┐
                                                          β”‚
input audio ─ EnCodec encode ─ CPTransformer ─ prefix Q/K/V ──┐
                                                              β”‚
                                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β–Ό
              MusicGen-large LM (48 AR layers, delay pattern)
                                       β”‚
                                       β–Ό
                              EnCodec decoder β†’ 32 kHz wav
  • CPTransformer: shares the base LM's transformer blocks (norm/self-attn/FFN) but adds learned pos_emb (49, 501, 2048), merge_linear[i] per layer (2048 β†’ 2048), and a zero-init gate[i] scalar.
  • Prefix injection (per self-attn): second SDPA over the input audio's K/V, with dt_q = prefix_q[step] + main_q, gated add before out_proj: attn = main_attn + dt_attn Γ— gate[i].

Files

  • model.safetensors β€” quantized LM (INT4 affine, group size 64) + adapter weights
  • config.json β€” architecture + quantization + instruct metadata
  • compression_state_dict.bin β€” passthrough of upstream EnCodec for offline init

Source

License

CC-BY-NC 4.0 β€” inherited from MusicGen + the upstream checkpoint. Non-commercial use only.

Downloads last month
53
Safetensors
Model size
0.8B params
Tensor type
F32
Β·
BF16
Β·
U32
Β·
F16
Β·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aufklarer/Instruct-MusicGen-MLX-4bit

Finetuned
(2)
this model

Collection including aufklarer/Instruct-MusicGen-MLX-4bit

Paper for aufklarer/Instruct-MusicGen-MLX-4bit