Whisper-large-v3 Slovenian
Fine-tuned OpenAI Whisper-large-v3 for Slovenian automatic speech recognition, augmented with TTS-generated synthetic data.
This model is part of the paper: "Synthetic Speech Augmentation for Low-Resource Estonian and Slovenian ASR: Comparing Parakeet-TDT and Whisper" (Interspeech 2026). Paper coming soon.
Note: Fine-tuning Whisper on CommonVoice data degrades FLEURS performance for Slovenian. Zero-shot Whisper (37.02% raw / 19.18% normalized) outperforms all fine-tuned variants on FLEURS. This is because Whisper's pre-trained model handles FLEURS-style punctuation and casing, which fine-tuning on CV-style data erodes. On the in-domain CV17 test set, this model provides substantial improvements over both zero-shot and CV-only fine-tuning.
Model Description
- Architecture: Encoder-decoder Transformer (Whisper)
- Parameters: 1.55B
- Tokenizer: 51,865-token byte-level BPE
- Base model:
openai/whisper-large-v3 - Fine-tuning data: CommonVoice 17.0 Slovenian + ~5,850 synthetic sentences (paraphrase + domain expansion, without morphological category)
- Training config: CV + Synth No Morph (synthetic corpus excluding morphological augmentation)
Evaluation Results
Raw WER/CER (no text normalization)
| Test Set | WER | CER |
|---|---|---|
| CommonVoice 17 Test | 15.65 | 3.88 |
| CommonVoice 17 Val | 15.08 | 3.60 |
| FLEURS Test | 40.46 | 11.98 |
Normalized WER/CER (lowercase + punctuation removal)
| Test Set | WER | CER |
|---|---|---|
| CommonVoice 17 Test | 13.24 | 3.48 |
| CommonVoice 17 Val | 12.98 | 3.24 |
| FLEURS Test | 24.52 | 8.32 |
Improvement over baselines
| Comparison | CV17 Test (WER) | FLEURS Test (WER) |
|---|---|---|
| vs. Zero-shot | -5.55 pp | +3.44 pp* |
| vs. CV-only fine-tuning | -3.66 pp | -6.33 pp |
*FLEURS degrades vs. zero-shot due to domain mismatch (see note above), but improves substantially vs. CV-only fine-tuning.
All CV17 improvements are statistically significant (paired bootstrap, p < 0.001, n = 100,000).
Usage
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-slovenian")
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-slovenian")
# Load audio (16kHz mono)
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)
# Transcribe
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
with torch.no_grad():
predicted_ids = model.generate(input_features, language="sl", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Training Details
- Optimizer: AdamW fused (lr=5e-5)
- Schedule: Linear decay with 10% warmup
- Effective batch size: 128 (64 x 2 gradient accumulation)
- Epochs: 5
- Best model: selected by eval_loss
- Precision: bf16
- Seed: 42
Synthetic Data Augmentation
The synthetic training data was generated using a three-stage pipeline:
- Text generation: GPT-5-mini generates diverse sentences across paraphrase, domain expansion, and morphological categories
- LLM-as-judge validation: Each sentence validated for grammaticality, naturalness, and language purity
- Speech synthesis: OpenAI gpt-4o-mini-tts with 11-voice rotation
Dataset: yuriyvnv/synthetic_asr_et_sl
Acknowledgments
- Base model: OpenAI Whisper-large-v3
- Training data: Mozilla Common Voice 17.0
- Evaluation: Google FLEURS
- Downloads last month
- 34
Model tree for yuriyvnv/whisper-large-v3-slovenian
Base model
openai/whisper-large-v3Datasets used to train yuriyvnv/whisper-large-v3-slovenian
Evaluation results
- WER (raw) on Common Voice 17.0 (sl) - Testtest set self-reported15.650
- WER (normalized) on Common Voice 17.0 (sl) - Testtest set self-reported13.240
- CER (raw) on Common Voice 17.0 (sl) - Testtest set self-reported3.880
- WER (raw) on FLEURS (sl) - Testtest set self-reported40.460
- WER (normalized) on FLEURS (sl) - Testtest set self-reported24.520
- CER (raw) on FLEURS (sl) - Testtest set self-reported11.980