Whisper-large-v3 Slovenian

Fine-tuned OpenAI Whisper-large-v3 for Slovenian automatic speech recognition, augmented with TTS-generated synthetic data.

This model is part of the paper: "Synthetic Speech Augmentation for Low-Resource Estonian and Slovenian ASR: Comparing Parakeet-TDT and Whisper" (Interspeech 2026). Paper coming soon.

Note: Fine-tuning Whisper on CommonVoice data degrades FLEURS performance for Slovenian. Zero-shot Whisper (37.02% raw / 19.18% normalized) outperforms all fine-tuned variants on FLEURS. This is because Whisper's pre-trained model handles FLEURS-style punctuation and casing, which fine-tuning on CV-style data erodes. On the in-domain CV17 test set, this model provides substantial improvements over both zero-shot and CV-only fine-tuning.

Model Description

Architecture: Encoder-decoder Transformer (Whisper)
Parameters: 1.55B
Tokenizer: 51,865-token byte-level BPE
Base model: openai/whisper-large-v3
Fine-tuning data: CommonVoice 17.0 Slovenian + ~5,850 synthetic sentences (paraphrase + domain expansion, without morphological category)
Training config: CV + Synth No Morph (synthetic corpus excluding morphological augmentation)

Evaluation Results

Raw WER/CER (no text normalization)

Test Set	WER	CER
CommonVoice 17 Test	15.65	3.88
CommonVoice 17 Val	15.08	3.60
FLEURS Test	40.46	11.98

Normalized WER/CER (lowercase + punctuation removal)

Test Set	WER	CER
CommonVoice 17 Test	13.24	3.48
CommonVoice 17 Val	12.98	3.24
FLEURS Test	24.52	8.32

Improvement over baselines

Comparison	CV17 Test (WER)	FLEURS Test (WER)
vs. Zero-shot	-5.55 pp	+3.44 pp*
vs. CV-only fine-tuning	-3.66 pp	-6.33 pp

*FLEURS degrades vs. zero-shot due to domain mismatch (see note above), but improves substantially vs. CV-only fine-tuning.

All CV17 improvements are statistically significant (paired bootstrap, p < 0.001, n = 100,000).

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-slovenian")
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-slovenian")

# Load audio (16kHz mono)
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)

# Transcribe
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
with torch.no_grad():
    predicted_ids = model.generate(input_features, language="sl", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Training Details

Optimizer: AdamW fused (lr=5e-5)
Schedule: Linear decay with 10% warmup
Effective batch size: 128 (64 x 2 gradient accumulation)
Epochs: 5
Best model: selected by eval_loss
Precision: bf16
Seed: 42

Synthetic Data Augmentation

The synthetic training data was generated using a three-stage pipeline:

Text generation: GPT-5-mini generates diverse sentences across paraphrase, domain expansion, and morphological categories
LLM-as-judge validation: Each sentence validated for grammaticality, naturalness, and language purity
Speech synthesis: OpenAI gpt-4o-mini-tts with 11-voice rotation

Dataset: yuriyvnv/synthetic_asr_et_sl

Acknowledgments

Base model: OpenAI Whisper-large-v3
Training data: Mozilla Common Voice 17.0
Evaluation: Google FLEURS

Downloads last month: 34

Safetensors

Model size

2B params

Tensor type

F32

Model tree for yuriyvnv/whisper-large-v3-slovenian

Base model

openai/whisper-large-v3

Finetuned

(814)

this model

Datasets used to train yuriyvnv/whisper-large-v3-slovenian

Evaluation results

WER (raw) on Common Voice 17.0 (sl) - Test
test set self-reported

15.650
WER (normalized) on Common Voice 17.0 (sl) - Test
test set self-reported

13.240
CER (raw) on Common Voice 17.0 (sl) - Test
test set self-reported

3.880
WER (raw) on FLEURS (sl) - Test
test set self-reported

40.460
WER (normalized) on FLEURS (sl) - Test
test set self-reported

24.520
CER (raw) on FLEURS (sl) - Test
test set self-reported

11.980