Sori-4B-Base

Korean speech-to-text model combining Qwen3-Omni's Audio Transformer (AuT) with Qwen3-4B LLM.

GitHub: SeungyounShin/Sori

Architecture

Audio (16kHz) -> Mel Spectrogram (128 bins) -> Audio Encoder (647M, AuT) -> audio_proj MLP -> Qwen3-4B LLM -> Text
Component Params Source
Audio Encoder 647M Qwen3-Omni AuT (pretrained on 7M+ hours)
audio_proj 12M 2-layer MLP (2048 -> 2560 -> 2560), trained from scratch
LLM 4B Qwen3-4B-Instruct (frozen in Stage 1)

Quick Start

git clone https://github.com/SeungyounShin/Sori.git
cd Sori
pip install torch torchaudio transformers peft accelerate safetensors

Transcription

from modeling_sori_speech import SoriSpeechForConditionalGeneration
from processing_sori_speech import SoriSpeechProcessor
from sori_speech_utils import process_mm_info
import torch

model = SoriSpeechForConditionalGeneration.from_pretrained(
    "Seungyoun/Sori-4B-Base", torch_dtype=torch.bfloat16, device_map="cuda",
    trust_remote_code=True,
)
model.eval()
processor = SoriSpeechProcessor.from_pretrained("Seungyoun/Sori-4B")

conversation = [
    {"role": "system", "content": "You are a helpful voice assistant."},
    {"role": "user", "content": [
        {"type": "audio", "audio": "path/to/audio.wav"},
        {"type": "text", "text": "Transcribe the audio."}
    ]},
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=False)
inputs = processor(text=text, audio=audios, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device).to(model.dtype) if v.dtype == torch.float32 else v.to(model.device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=256)

print(processor.decode(output[0], skip_special_tokens=True).split("assistant")[-1].strip())

Voice-Driven Tool Calling

Since the backbone LLM (Qwen3-4B-Instruct) natively supports tool use, Sori can understand spoken Korean and trigger tool calls -- even at Stage 1 with a frozen LLM.

SYSTEM_PROMPT = """You are a helpful voice assistant that can understand Korean speech and respond helpfully.
When the user asks a question, answer it directly. If you need external information, use the available tools.

# Tools

## get_weather
Get current weather information for a city.
Parameters:
- city (string, required): The city name (e.g. "μ„œμšΈ", "λΆ€μ‚°")

## search_web
Search the web for information.
Parameters:
- query (string, required): The search query

To use a tool, respond with:
<tool_call>
{"name": "tool_name", "arguments": {"param": "value"}}
</tool_call>
"""

conversation = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": [
        {"type": "audio", "audio": "weather.mp3"},  # "ν˜Ήμ‹œ μ§€κΈˆ μ„œμšΈ 날씨가 μ–΄λ–»κ²ŒλΌ?"
    ]},
]

# ... same inference code as above ...

Output:

You seem to be asking about the current weather in Seoul. Let me check that for you.

<tool_call>
{"name": "get_weather", "arguments": {"city": "μ„œμšΈ"}}
</tool_call>

The model correctly understands the spoken Korean question about Seoul's weather and generates the appropriate get_weather tool call. See inference_if.py for the full example.

Sample Results (Stage 1, Step 6000)

Expected Predicted
λ¨Όμ € μ½”λ‘œλ‚˜ ν™•μ§„μž ν˜„ν™©λΆ€ν„° μ§šμ–΄λ³΄μ£ . λ¨Όμ € μ½”λ‘œλ‚˜ ν™•μ§„μž ν˜„ν™©λΆ€ν„° μ§‘νžˆ 보죠.
λ„€ μ•ˆλ…•ν•˜μ„Έμš”. λ„€ μ•ˆλ…•ν•˜μ„Έμš”?
λ¬Έμ˜λ“œλ¦΄ 게 μžˆμ–΄μ„œ μ „ν™” λ“œλ Έμ–΄μš”. μ΄λ ‡κ²Œ λ“€μ„κ²Œ μžˆμ–΄μ„œ μ „ν™” λ“œλ Έμ–΄μš”.
λ„€ κ³ κ°λ‹˜ ν˜Ήμ‹œ 개λͺ… μ „ 이름과 μ „ν™”λ²ˆν˜Έ 말씀 λΆ€νƒλ“œλ¦½λ‹ˆλ‹€ λ„€, κ³ κ°λ‹˜ ν˜Ήμ‹œ 개λͺ… μ „ 이름과 μ „ν™”λ²ˆν˜Έ 말씀 λΆ€νƒλ“œλ¦½λ‹ˆλ‹€.

This is a Stage 1 (alignment only) checkpoint where only the 12M audio_proj MLP was trained. The LLM is frozen. Stage 2 (LoRA fine-tuning) will improve accuracy significantly.

Training

Two-Stage Approach

Following LLaVA and Qwen3-Omni's methodology:

Stage 1 - Alignment (this release): Train only audio_proj to map audio features into the LLM's embedding space.

Setting Value
Trainable params audio_proj only (12M / 4.7B = 0.25%)
Audio Encoder Frozen
LLM Frozen
Learning rate 1e-4
Effective batch size 1024 (8 x 8 GPUs x 16 accum)
Loss Cross-entropy with label masking
Steps 6,000
Hardware 8x H100 80GB

Stage 2 - Fine-tuning (planned): Unfreeze LLM with LoRA (r=16, alpha=32) + continue training audio_proj.

Dataset

4.1M Korean speech samples:

Dataset Samples Ratio
Zeroth-STT-Korean 102K 2.5%
AIHub 012 - Counseling Speech 831K 20.0%
AIHub 71592 - Job Interview 76K 1.8%
AIHub 71481 - In-depth Interview 802K 19.3%
AIHub 464 - Meeting Speech 2.3M 56.3%

Loss Curve

Training Loss

Two distinct phases:

  • Steps 0-2500: Loss plateaus around 3.0 as audio_proj learns initial mapping
  • Steps 2500+: Sharp drop to ~1.0 as alignment clicks into place; transcription quality jumps dramatically

Key Technical Detail

The mel spectrogram must match Qwen3-Omni's WhisperFeatureExtractor exactly (Slaney mel scale + log10 + normalization). Using torchaudio defaults (HTK scale + natural log) produces completely wrong features for the pretrained AuT encoder - this was the root cause of initial training failure.

License

Apache 2.0

Downloads last month
5
Safetensors
Model size
5B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Seungyoun/Sori-4B-Base

Finetunes
1 model