whisper-small-arabic-cv18-lora

Fine-tuned OpenAI Whisper-small for Arabic ASR using LoRA, trained on Common Voice 18 Arabic under a strict 16 GB VRAM constraint.

MBZUAI — Institute of Foundation Models · Technical Test · April 2026 Author: Nassima OULD OUALI

Model description

This model is a parameter-efficient fine-tune of openai/whisper-small (244M parameters) using LoRA (Low-Rank Adaptation). Only 0.73% of parameters are trainable (≈1.77M out of 244M), enabling training within a 12.3 GB peak VRAM footprint.

The model targets Arabic ASR across Modern Standard Arabic (MSA), with awareness of the specific challenges of the Arabic language: diglossia, morphological richness, diacritics, and data scarcity.

Training data

Split	Samples	Duration
Train (used)	5,000	5.72 h
Full train available	28,410	≈ 35.5 h
Validation	10,471	≈ 13.1 h
Test	10,471	≈ 13.1 h

Source: MohamedRashad/common-voice-18-arabic — unofficial CC-0 Arabic extraction of Mozilla Common Voice 18.0, in Parquet format.

Evaluation results

All metrics computed on normalised undiacritised Arabic after applying a consistent normalize_arabic() pipeline to both predictions and references. These figures are not directly comparable to results reported on raw orthography.

Model	Condition	WER (%)	CER (%)
Whisper-small	Zero-shot Arabic	47.55	21.11
Whisper-small + LoRA	Fine-tuned (this model)	37.39	12.14
Whisper-small	Full fine-tune, FLEURS Arabic†	23.76	—

† Different benchmark (FLEURS), different normalisation — not directly comparable.

Gain over zero-shot baseline:

WER: −10.16 pp absolute (−21.4% relative)
CER: −9.0 pp absolute (−42.5% relative)
Perfect transcriptions: 16.9% → 20.1%

Training dynamics

Epoch	Val WER (%)	Val CER (%)	Train Loss
1	34.51	10.88	1.11
2	32.68	9.41	0.43
3	31.05	8.89	0.38
4	30.81	8.88	0.35
5	30.39	8.76	0.33

Monotonic WER decrease across all 5 epochs — no overfitting observed.

Usage

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel

# Load base model + LoRA adapter
processor = WhisperProcessor.from_pretrained(
    "nassimaODL/whisper-small-arabic-cv18-lora",
    language="arabic",
    task="transcribe"
)
base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(
    base_model,
    "nassimaODL/whisper-small-arabic-cv18-lora"
)
model = model.merge_and_unload()  # merge adapters for faster inference
model.eval()

# Transcribe
import numpy as np

def transcribe(audio_array, sampling_rate=16000):
    forced_ids = processor.get_decoder_prompt_ids(language="arabic", task="transcribe")
    inputs = processor(
        audio_array,
        sampling_rate=sampling_rate,
        return_tensors="pt"
    )
    with torch.no_grad():
        ids = model.generate(
            inputs.input_features,
            forced_decoder_ids=forced_ids,
            no_repeat_ngram_size=3,   # prevents repetition loops on short audio
        )
    return processor.decode(ids[0], skip_special_tokens=True)

Note: set no_repeat_ngram_size=3 to suppress the repetition loop pathology that affects ≈1% of short or low-energy audio samples (known Whisper failure mode).

LoRA configuration

Parameter	Value
Base model	openai/whisper-small
Method	LoRA
Rank r	16
Alpha α	32
Dropout	0.05
Target modules	q_proj, v_proj (all attention layers, encoder + decoder)
Trainable params	≈1.77M / 244M (0.73%)
Peak GPU memory	12.3 GB

Why q_proj + v_proj in encoder and decoder? Including v_proj alongside q_proj consistently outperforms q_proj alone on low-resource languages (LoRA-Whisper, Interspeech 2024; Pashto ASR, arXiv:2604.06507). The encoder is included because CV-18 Arabic's acoustic properties (Quranic prosody, diacritised read speech) differ substantially from Whisper's pre-training distribution.

Training configuration

Hyperparameter	Value
Learning rate	1e-4
Effective batch size	32 (batch=16 × grad_accum=2)
LR scheduler	Linear warmup (100 steps) + decay
Epochs	5
Early stopping	Patience = 3 on val WER
Mixed precision	fp16
Optimizer	AdamW
Training time	≈ 30 min (NVIDIA A100)
Primary metric	Normalised WER (not loss)

Limitations

MSA only. Trained on read MSA + Quranic text. Generalises poorly to dialectal Arabic (Darija, Egyptian, Gulf). This is the most critical limitation for real-world deployment.
Read speech bias. Common Voice is scripted read speech — not conversational Arabic.
Undiacritised output. Cannot reliably produce diacritised transcriptions, which limits compatibility with downstream TTS pipelines.
Repetition loops. Affects ≈1% of short/low-energy audio. Mitigated with no_repeat_ngram_size=3 at inference time.
Short utterance regression. Occasional over-generation on single-word inputs.

Intended use

This model is intended for research and experimentation with Arabic ASR, particularly in the context of:

MSA speech transcription
Parameter-efficient fine-tuning methodology on low-resource languages
Baseline for dialectal Arabic adaptation

It is not recommended for production deployment without further evaluation on dialectal speech and spontaneous conversation data.

References

Radford et al. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. ICML. arXiv:2212.04356
Hu et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR. arXiv:2106.09685
Talafha et al. (2023). N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition. Interspeech. arXiv:2306.02902
Song et al. (2024). LoRA-Whisper: Parameter-Efficient and Generalizable Multilingual ASR. Interspeech. arXiv:2406.07947
Hanif et al. (2025). Fine-Tuning Whisper for Pashto ASR. arXiv:2604.06507

Downloads last month: 35

Model tree for nassimaODL/whisper-small-arabic-cv18-lora

Base model

openai/whisper-small

Adapter

(218)

this model

Dataset used to train nassimaODL/whisper-small-arabic-cv18-lora

Papers for nassimaODL/whisper-small-arabic-cv18-lora

Fine-tuning Whisper for Pashto ASR: strategies and scale

Paper • 2604.06507 • Published 12 days ago

Inverse scattering problem for third order differential operators on the whole axis

Paper • 2406.07947 • Published Jun 12, 2024

Evaluation results

WER on Common Voice 18 Arabic
test set self-reported

37.390
CER on Common Voice 18 Arabic
test set self-reported

12.140