S2 Pro β€” GGUF

ALPHA β€” EXPERIMENTAL The inference engine (s2.cpp) is an early-stage, community-built project. Expect rough edges and breaking changes. Not production-ready.

GGUF-quantized weights of Fish Audio S2 Pro, a high-quality multilingual text-to-speech model with voice cloning support, packaged for local inference with s2.cpp β€” a pure C++/GGML engine with no Python dependency.

License: Fish Audio Research License β€” free for research and non-commercial use. Commercial use requires a separate license from Fish Audio. See LICENSE.md and fish.audio.


Files

File Size
s2-pro-f16.gguf 9.3 GB
s2-pro-q8_0.gguf 5.3 GB
s2-pro-q6_k.gguf 4.3 GB
s2-pro-q5_k_m.gguf 3.8 GB
s2-pro-q4_k_m.gguf 3.4 GB
s2-pro-q3_k.gguf 2.9 GB
s2-pro-q2_k.gguf 2.4 GB
tokenizer.json 12 MB

All GGUF files contain both the transformer weights and the audio codec in a single file.


Requirements

  • GPU with Vulkan support (AMD/NVIDIA/Intel) or CPU with enough RAM
  • s2.cpp built from source (C++17 + CMake)

VRAM guide

VRAM Recommended
β‰₯ 8 GB q8_0
6–8 GB q6_k
4–6 GB q5_k_m
3–4 GB q4_k_m
< 3 GB q3_k / q2_k (quality degrades)
CPU only q4_k_m or lower (slow)

Quick start

# Clone and build s2.cpp
git clone --recurse-submodules https://github.com/rodrigomatta/s2.cpp.git
cd s2.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_VULKAN=ON
cmake --build build --parallel $(nproc)

# Download model files (example with huggingface-cli)
huggingface-cli download rodrigomt/s2-pro-gguf s2-pro-q6_k.gguf tokenizer.json --local-dir .

# Synthesize
./build/s2 \
  -m s2-pro-q6_k.gguf \
  -t tokenizer.json \
  -text "Hello, this is a test." \
  -v 0 \
  -o output.wav

Voice cloning

./build/s2 \
  -m s2-pro-q6_k.gguf \
  -t tokenizer.json \
  -pa reference.wav \
  -pt "Transcript of the reference audio." \
  -text "Text to synthesize in that voice." \
  -v 0 \
  -o output.wav

Reference audio: 5–30 seconds, clean recording, WAV or MP3.


Quantization

The GGUF files in this repository were re-quantized using a modified llama-quantize tool with improved quantization routines.


Model architecture

S2 Pro uses a Dual-AR architecture (~4.56B parameters total):

  • Slow-AR β€” 36-layer Qwen3 transformer (4.13B params), GQA (32 heads / 8 KV heads), RoPE 1M base, persistent KV cache
  • Fast-AR β€” 4-layer transformer (0.42B params) generating 10 acoustic codebook tokens per semantic step
  • Audio codec β€” convolutional RVQ encoder/decoder (10 codebooks Γ— 4096 entries)

License

The model weights are licensed under the Fish Audio Research License.

  • Research and non-commercial use: free under this license
  • Commercial use: requires a separate written license from Fish Audio

Attribution: "This model is licensed under the Fish Audio Research License, Copyright Β© 39 AI, INC. All Rights Reserved."

Full terms: LICENSE.md Β· Commercial: fish.audio Β· business@fish.audio

Downloads last month
4,418
GGUF
Model size
5B params
Architecture
fish-speech
Hardware compatibility
Log In to add your hardware

2-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for rodrigomt/s2-pro-gguf

Base model

fishaudio/s2-pro
Quantized
(4)
this model