--- language: # ISO 639-1 (official) - aa - ab - ae - af - ak - am - an - ar - as - av - ay - az - ba - be - bg - bh - bi - bm - bn - bo - br - bs - ca - ce - ch - co - cr - cs - cu - cv - cy - da - de - dv - dz - ee - el - en - eo - es - et - eu - fa - ff - fi - fj - fo - fr - fy - ga - gd - gl - gn - gu - gv - ha - he - hi - ho - hr - ht - hu - hy - hz - ia - id - ie - ig - ii - ik - io - is - it - iu - ja - jv - ka - kg - ki - kj - kk - kl - km - kn - ko - kr - ks - ku - kv - kw - ky - la - lb - lg - li - ln - lo - lt - lu - lv - mg - mh - mi - mk - ml - mn - mr - ms - mt - my - na - nb - nd - ne - ng - nl - nn - no - nr - nv - ny - oc - oj - om - or - os - pa - pi - pl - ps - pt - qu - rm - rn - ro - ru - rw - sa - sc - sd - se - sg - si - sk - sl - sm - sn - so - sq - sr - ss - st - su - sv - sw - ta - te - tg - th - ti - tk - tl - tn - to - tr - ts - tt - tw - ty - ug - uk - ur - uz - ve - vi - vo - wa - wo - xh - yi - yo - za - zh - zu - fil # Filipino - cmn # Mandarin Chinese - yue # Cantonese - ars # Najdi Arabic - ary # Moroccan Arabic - arz # Egyptian Arabic - prs # Dari - pes # Iranian Persian - bho # Bhojpuri - mai # Maithili - hif # Fiji Hindi - tzm # Central Atlas Tamazight - kab # Kabyle - ber # Berber (macro) - srd # Sardinian - ast # Asturian - lad # Ladino - lmo # Lombard - nap # Neapolitan - ckb # Central Kurdish (Sorani) library_name: transformers tags: - speech - audio - automatic-speech-recognition - asr - multi-lingual - transformers - heep - heep-universal - entropy-based-curation metrics: - wer pipeline_tag: automatic-speech-recognition --- # Cross-Architecture Validation with HEEP-Indic ### 🔗 Resources * **Reproducibility (Universal Model):** [https://huggingface.co/bc7ec356/heep-universal](https://huggingface.co/bc7ec356/heep-universal) * **Cross-Architecture Model (Indic):** [https://huggingface.co/bc7ec356/heep-indic](https://huggingface.co/bc7ec356/heep-indic) ## Cross-Architecture Generalization To directly address concerns about generalization beyond Whisper V3 Turbo, we trained **Qwen3-ASR (1.7B)**, an architecturally distinct audio-language model, on HEEP-curated data spanning **46 Indian languages** (~4.78M utterances). The curation pipeline is identical to the one described in the paper with no architecture-specific tuning. ## Hindi Benchmark Comparison (7 Benchmarks) | Model | Kathbath | Kathbath Noisy | CommonVoice | FLEURS | IndicTTS | RESPIN | Gramvaani | **Avg** | | :------------------------- | :------: | :------------: | :---------: | :-------: | :------: | :-------: | :-------: | :------: | | Google STT | 14.3 | 16.7 | 20.8 | 19.4 | 18.3 | – | 59.9 | 24.9 | | IndicWav2Vec | 12.2 | 16.2 | 20.2 | 18.3 | 15.0 | – | 42.1 | 20.7 | | Azure STT | 13.6 | 15.1 | 14.6 | 24.3 | 15.2 | – | 42.3 | 20.8 | | Nvidia Conformer-CTC Large | 12.7 | 14.2 | 21.2 | 15.7 | 12.2 | – | 42.6 | 19.8 | | IndicWhisper | 10.3 | 12.0 | 15.0 | 11.4 | 7.6 | – | 26.8 | 13.8 | | **HEEP-Indic** | **8.53** | **8.97** | **9.96** | **11.04** | **6.59** | **12.05** | **25.98** | **11.9** | **HEEP-Indic achieves 11.9% average Hindi WER vs. 13.8% for IndicWhisper (14% relative improvement).** ## Multilingual Results (16 Languages) | Dataset | Ben | Bho | Chh | Guj | Hin | Kan | Mag | Mai | Mal | Mar | Odi | Pun | San | Tam | Tel | Urd | **Avg** | | :------------ | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | | Kathbath | 14.6 | – | – | 17.4 | 8.5 | 23.0 | – | – | 39.3 | 19.2 | 25.4 | 15.8 | 41.4 | 30.3 | 29.0 | 12.1 | 23.0 | | Kathbath Hard | 15.7 | – | – | 18.5 | 9.0 | 25.1 | – | – | 41.2 | 20.4 | 27.7 | 16.6 | 43.6 | 32.6 | 30.3 | 11.9 | 24.4 | | CommonVoice | 21.0 | – | – | – | 10.0 | – | – | – | 46.0 | 21.5 | 34.6 | 17.5 | – | 34.0 | – | 20.6 | 25.7 | | FLEURS | 22.4 | – | – | 23.3 | 11.0 | 23.1 | – | – | 34.4 | 25.5 | 33.3 | 25.0 | – | 35.1 | 31.9 | 22.4 | 26.1 | | IndicTTS | 15.8 | – | – | 16.9 | 6.6 | 19.6 | – | – | 26.4 | 14.5 | 14.8 | – | – | 22.6 | 31.3 | – | 18.7 | | Gramvaani | – | – | – | – | 26.0 | – | – | – | – | – | – | – | – | – | – | – | 26.0 | | RESPIN | 32.5 | 21.3 | 21.6 | – | 12.1 | 45.6 | 27.7 | 41.1 | – | 32.7 | – | – | – | – | 37.5 | – | 30.2 | | **Avg** | **20.4** | **21.3** | **21.6** | **19.0** | **11.9** | **27.3** | **27.7** | **41.1** | **37.5** | **22.3** | **27.2** | **18.7** | **42.5** | **30.9** | **32.0** | **16.7** | **24.6** | ## Key Takeaways 1. **Cross-architecture generalization confirmed.** The same HEEP pipeline improves two distinct backbones: Whisper V3 Turbo (0.8B, encoder-decoder) and Qwen3-ASR (1.7B, audio-language model), without modification. 2. **Controlled multilingual evaluation.** Results span 16 languages across Indo-Aryan, Dravidian, and Classical families on standardized benchmarks with consistent evaluation protocols. 3. **Model-independent scoring.** Entropy scoring operates on MFCCs, G2P phonemes, and token distributions, not model internals. The same curated dataset was used for both backbones. 4. **Reproducibility.** Model weights, curation code, and training scripts for both backbones are at the anonymous repository. --- ## Model Overview HEEP Universal supports transcription across **204 languages**, including a wide range of Indic and global languages, with consistent performance across various domains such as meetings, earnings calls, broadcast media, and educational content. The model is optimized for high-precision, verbatim transcription capturing spoken content word-for-word with remarkable fidelity. **Core Insight**: Strategic selection of high-entropy samples leads to better ASR models than training on larger but redundant datasets. ## HEEP Methodology HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources. ### Mathematical Foundation #### Sample Score (Equation 1) The information score for each sample combines multiple entropy dimensions: ``` S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + β·MI(x, D) ``` Where: - `H_acoustic(x)`: Spectral/MFCC entropy measuring acoustic diversity - `H_phonetic(x)`: Phoneme distribution entropy capturing phonetic complexity - `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness - `H_contextual(x)`: Domain and discourse entropy - `MI(x, D)`: Mutual information contribution relative to dataset - `α₁...α₄, β`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15) #### Mutual Information (Equation 2) The mutual information between acoustic features and transcription: ``` I(x, y) = Σ_{j,ℓ} p(f_j, y_ℓ) log [p(f_j, y_ℓ) / (p(f_j)·p(y_ℓ))] ``` #### Selection Criterion Samples are selected based on a threshold: ``` D' = {x ∈ D : S(x) > τ} ``` #### Progressive Filtering (Equation 8) The threshold increases exponentially across rounds: ``` τ_{k+1} = τ_k · growth_factor ``` #### Error-Aware Adaptation After each training round, sample scores are adjusted based on model errors: ``` S'(x) = S(x) + λ_err·ErrorRelevance(x, errors_k) + λ_cross·CrossLingualOverlap(x) ``` ### Algorithm Overview ``` Algorithm: HEEP Data Curation with Error-Aware Adaptation Input: Dataset D, initial threshold τ₀, growth factor g Output: Curated dataset D* 1. Initialize scorer with entropy estimators 2. Fit scorer to D (compute normalization stats, fit MI estimator) 3. D* ← D 4. k ← 0 5. While |D*| > min_samples AND k < max_rounds: a. For each x in D*: Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + β·MI(x, D) b. If error_patterns available: Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x) + λ_cross·CrossLingualOverlap(x) c. D* ← {x ∈ D* : S'(x) > τₖ} d. If train_callback: Train model on D* e. If eval_callback: Analyze errors, update error_patterns f. τₖ₊₁ ← τₖ · g g. k ← k + 1 6. Return D* ``` ### Key Benefits - Training on **10-20% of data** while matching or exceeding full-dataset performance - Efficient multilingual model development with cross-lingual transfer - Error-aware adaptive sample selection across training rounds - Significant reduction in computational resources and training time ## Performance Benchmarks ### OpenASR Leaderboard Results | Dataset | WER (%) | RTFx | | ---------------------- | ------- | ------ | | AMI Test | 4.19 | 70.22 | | Earnings22 Test | 5.83 | 101.52 | | GigaSpeech Test | 4.99 | 131.09 | | LibriSpeech Test Clean | 0.71 | 158.74 | | LibriSpeech Test Other | 2.17 | 142.40 | | SPGISpeech Test | 1.10 | 170.85 | | TedLium Test | 1.43 | 153.34 | | VoxPopuli Test | 4.34 | 179.28 | ### Composite Results - **Overall WER**: 3.10% - **Average RTFx**: 146.23 *RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.* ## Model Details - **Architecture**: Transformer-based encoder-decoder optimized for multilingual transcription - **Languages**: 204 languages supported - **Format**: Transformers compatible (safetensors) - **Sampling Rate**: 16 kHz - **Precision**: FP16/FP32 supported - **Optimization**: Real-time inference capable with GPU acceleration ## Key Features - **Exceptional Accuracy**: Achieves 3.10% WER across diverse English test sets - **Real-Time Performance**: Average RTFx of 146.23 enables real-time applications - **Verbatim Transcription**: Optimized for accurate, word-for-word transcription - **Multi-Domain Excellence**: Superior performance across conversational, broadcast, and read speech - **Multilingual Support**: 204 languages with cross-lingual transfer learning - **HEEP-Curated Training**: Strategic entropy-based data selection for maximum information density ## Usage ```python from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline import torch device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model = AutoModelForSpeechSeq2Seq.from_pretrained( "bc7ec356/heep-universal", torch_dtype=torch_dtype, use_safetensors=True, ) model.to(device) processor = AutoProcessor.from_pretrained("bc7ec356/heep-universal") pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, torch_dtype=torch_dtype, device=device, ) result = pipe("audio.wav") print(result["text"]) ``` ## Use Cases HEEP Universal excels in various speech recognition scenarios: - **Meeting Transcription**: High accuracy on conversational speech (AMI: 4.19% WER) - **Financial Communications**: Specialized performance on earnings calls (Earnings22: 5.83% WER) - **Broadcast Media**: Excellent results on news, podcasts, and media content - **Educational Content**: Optimized for lectures and presentations - **Customer Support**: Accurate transcription of support calls - **Legal Documentation**: Professional-grade accuracy for legal proceedings - **Medical Transcription**: High-quality transcription for medical consultations ## Performance Optimization Tips - **GPU Acceleration**: Use `device="cuda"` for significantly faster inference - **Precision**: Set `torch_dtype=torch.float16` for optimal speed on modern GPUs - **Language Specification**: Specify language code when known to improve accuracy and speed - **Beam Size**: Use `beam_size=5` for best accuracy, reduce for faster inference - **Batch Processing**: Process multiple files with a single model instance for efficiency ## Acknowledgments HEEP Universal was developed using the HEEP framework for entropy-based data curation. We thank the open-source community for providing foundational tools that make this work possible. ## Citation If you use this model in your research, please cite: ```bibtex @article{anonymous2026heep, title={HEEP: High Entropy Exponential Pruning for State-of-the-Art ASR Through Strategic Data Curation}, author={Anonymous}, journal={Under Review}, year={2026} } ```