S2 Pro β GGUF
ALPHA β EXPERIMENTAL The inference engine (s2.cpp) is an early-stage, community-built project. Expect rough edges and breaking changes. Not production-ready.
GGUF-quantized weights of Fish Audio S2 Pro, a high-quality multilingual text-to-speech model with voice cloning support, packaged for local inference with s2.cpp β a pure C++/GGML engine with no Python dependency.
License: Fish Audio Research License β free for research and non-commercial use. Commercial use requires a separate license from Fish Audio. See LICENSE.md and fish.audio.
Files
| File | Size |
|---|---|
s2-pro-f16.gguf |
9.3 GB |
s2-pro-q8_0.gguf |
5.3 GB |
s2-pro-q6_k.gguf |
4.3 GB |
s2-pro-q5_k_m.gguf |
3.8 GB |
s2-pro-q4_k_m.gguf |
3.4 GB |
s2-pro-q3_k.gguf |
2.9 GB |
s2-pro-q2_k.gguf |
2.4 GB |
tokenizer.json |
12 MB |
All GGUF files contain both the transformer weights and the audio codec in a single file.
Requirements
- GPU with Vulkan support (AMD/NVIDIA/Intel) or CPU with enough RAM
- s2.cpp built from source (C++17 + CMake)
VRAM guide
| VRAM | Recommended |
|---|---|
| β₯ 8 GB | q8_0 |
| 6β8 GB | q6_k |
| 4β6 GB | q5_k_m |
| 3β4 GB | q4_k_m |
| < 3 GB | q3_k / q2_k (quality degrades) |
| CPU only | q4_k_m or lower (slow) |
Quick start
# Clone and build s2.cpp
git clone --recurse-submodules https://github.com/rodrigomatta/s2.cpp.git
cd s2.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_VULKAN=ON
cmake --build build --parallel $(nproc)
# Download model files (example with huggingface-cli)
huggingface-cli download rodrigomt/s2-pro-gguf s2-pro-q6_k.gguf tokenizer.json --local-dir .
# Synthesize
./build/s2 \
-m s2-pro-q6_k.gguf \
-t tokenizer.json \
-text "Hello, this is a test." \
-v 0 \
-o output.wav
Voice cloning
./build/s2 \
-m s2-pro-q6_k.gguf \
-t tokenizer.json \
-pa reference.wav \
-pt "Transcript of the reference audio." \
-text "Text to synthesize in that voice." \
-v 0 \
-o output.wav
Reference audio: 5β30 seconds, clean recording, WAV or MP3.
Quantization
The GGUF files in this repository were re-quantized using a modified llama-quantize tool with improved quantization routines.
Model architecture
S2 Pro uses a Dual-AR architecture (~4.56B parameters total):
- Slow-AR β 36-layer Qwen3 transformer (4.13B params), GQA (32 heads / 8 KV heads), RoPE 1M base, persistent KV cache
- Fast-AR β 4-layer transformer (0.42B params) generating 10 acoustic codebook tokens per semantic step
- Audio codec β convolutional RVQ encoder/decoder (10 codebooks Γ 4096 entries)
License
The model weights are licensed under the Fish Audio Research License.
- Research and non-commercial use: free under this license
- Commercial use: requires a separate written license from Fish Audio
Attribution: "This model is licensed under the Fish Audio Research License, Copyright Β© 39 AI, INC. All Rights Reserved."
Full terms: LICENSE.md Β· Commercial: fish.audio Β· business@fish.audio
- Downloads last month
- 4,418
2-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for rodrigomt/s2-pro-gguf
Base model
fishaudio/s2-pro