---
language:
  # ISO 639-1 (official)
  - aa
  - ab
  - ae
  - af
  - ak
  - am
  - an
  - ar
  - as
  - av
  - ay
  - az
  - ba
  - be
  - bg
  - bh
  - bi
  - bm
  - bn
  - bo
  - br
  - bs
  - ca
  - ce
  - ch
  - co
  - cr
  - cs
  - cu
  - cv
  - cy
  - da
  - de
  - dv
  - dz
  - ee
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - ff
  - fi
  - fj
  - fo
  - fr
  - fy
  - ga
  - gd
  - gl
  - gn
  - gu
  - gv
  - ha
  - he
  - hi
  - ho
  - hr
  - ht
  - hu
  - hy
  - hz
  - ia
  - id
  - ie
  - ig
  - ii
  - ik
  - io
  - is
  - it
  - iu
  - ja
  - jv
  - ka
  - kg
  - ki
  - kj
  - kk
  - kl
  - km
  - kn
  - ko
  - kr
  - ks
  - ku
  - kv
  - kw
  - ky
  - la
  - lb
  - lg
  - li
  - ln
  - lo
  - lt
  - lu
  - lv
  - mg
  - mh
  - mi
  - mk
  - ml
  - mn
  - mr
  - ms
  - mt
  - my
  - na
  - nb
  - nd
  - ne
  - ng
  - nl
  - nn
  - no
  - nr
  - nv
  - ny
  - oc
  - oj
  - om
  - or
  - os
  - pa
  - pi
  - pl
  - ps
  - pt
  - qu
  - rm
  - rn
  - ro
  - ru
  - rw
  - sa
  - sc
  - sd
  - se
  - sg
  - si
  - sk
  - sl
  - sm
  - sn
  - so
  - sq
  - sr
  - ss
  - st
  - su
  - sv
  - sw
  - ta
  - te
  - tg
  - th
  - ti
  - tk
  - tl
  - tn
  - to
  - tr
  - ts
  - tt
  - tw
  - ty
  - ug
  - uk
  - ur
  - uz
  - ve
  - vi
  - vo
  - wa
  - wo
  - xh
  - yi
  - yo
  - za
  - zh
  - zu
  - fil   # Filipino
  - cmn   # Mandarin Chinese
  - yue   # Cantonese
  - ars   # Najdi Arabic
  - ary   # Moroccan Arabic
  - arz   # Egyptian Arabic
  - prs   # Dari
  - pes   # Iranian Persian
  - bho   # Bhojpuri
  - mai   # Maithili
  - hif   # Fiji Hindi
  - tzm   # Central Atlas Tamazight
  - kab   # Kabyle
  - ber   # Berber (macro)
  - srd   # Sardinian
  - ast   # Asturian
  - lad   # Ladino
  - lmo   # Lombard
  - nap   # Neapolitan
  - ckb   # Central Kurdish (Sorani)

library_name: transformers
tags:
- speech
- audio
- automatic-speech-recognition
- asr
- multi-lingual
- transformers
- heep
- heep-universal
- entropy-based-curation
metrics:
- wer
pipeline_tag: automatic-speech-recognition
---


# Cross-Architecture Validation with HEEP-Indic

### 🔗 Resources

* **Reproducibility (Universal Model):** [https://huggingface.co/bc7ec356/heep-universal](https://huggingface.co/bc7ec356/heep-universal)
* **Cross-Architecture Model (Indic):** [https://huggingface.co/bc7ec356/heep-indic](https://huggingface.co/bc7ec356/heep-indic)


## Cross-Architecture Generalization

To directly address concerns about generalization beyond Whisper V3 Turbo, we trained **Qwen3-ASR (1.7B)**, an architecturally distinct audio-language model, on HEEP-curated data spanning **46 Indian languages** (~4.78M utterances). The curation pipeline is identical to the one described in the paper with no architecture-specific tuning.

## Hindi Benchmark Comparison (7 Benchmarks)

| Model                      | Kathbath | Kathbath Noisy | CommonVoice |   FLEURS  | IndicTTS |   RESPIN  | Gramvaani |  **Avg** |
| :------------------------- | :------: | :------------: | :---------: | :-------: | :------: | :-------: | :-------: | :------: |
| Google STT                 |   14.3   |      16.7      |     20.8    |    19.4   |   18.3   |     –     |    59.9   |   24.9   |
| IndicWav2Vec               |   12.2   |      16.2      |     20.2    |    18.3   |   15.0   |     –     |    42.1   |   20.7   |
| Azure STT                  |   13.6   |      15.1      |     14.6    |    24.3   |   15.2   |     –     |    42.3   |   20.8   |
| Nvidia Conformer-CTC Large |   12.7   |      14.2      |     21.2    |    15.7   |   12.2   |     –     |    42.6   |   19.8   |
| IndicWhisper               |   10.3   |      12.0      |     15.0    |    11.4   |    7.6   |     –     |    26.8   |   13.8   |
| **HEEP-Indic**             | **8.53** |    **8.97**    |   **9.96**  | **11.04** | **6.59** | **12.05** | **25.98** | **11.9** |

**HEEP-Indic achieves 11.9% average Hindi WER vs. 13.8% for IndicWhisper (14% relative improvement).**

## Multilingual Results (16 Languages)

| Dataset       |    Ben   |    Bho   |    Chh   |    Guj   |    Hin   |    Kan   |    Mag   |    Mai   |    Mal   |    Mar   |    Odi   |    Pun   |    San   |    Tam   |    Tel   |    Urd   |  **Avg** |
| :------------ | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: |
| Kathbath      |   14.6   |     –    |     –    |   17.4   |    8.5   |   23.0   |     –    |     –    |   39.3   |   19.2   |   25.4   |   15.8   |   41.4   |   30.3   |   29.0   |   12.1   |   23.0   |
| Kathbath Hard |   15.7   |     –    |     –    |   18.5   |    9.0   |   25.1   |     –    |     –    |   41.2   |   20.4   |   27.7   |   16.6   |   43.6   |   32.6   |   30.3   |   11.9   |   24.4   |
| CommonVoice   |   21.0   |     –    |     –    |     –    |   10.0   |     –    |     –    |     –    |   46.0   |   21.5   |   34.6   |   17.5   |     –    |   34.0   |     –    |   20.6   |   25.7   |
| FLEURS        |   22.4   |     –    |     –    |   23.3   |   11.0   |   23.1   |     –    |     –    |   34.4   |   25.5   |   33.3   |   25.0   |     –    |   35.1   |   31.9   |   22.4   |   26.1   |
| IndicTTS      |   15.8   |     –    |     –    |   16.9   |    6.6   |   19.6   |     –    |     –    |   26.4   |   14.5   |   14.8   |     –    |     –    |   22.6   |   31.3   |     –    |   18.7   |
| Gramvaani     |     –    |     –    |     –    |     –    |   26.0   |     –    |     –    |     –    |     –    |     –    |     –    |     –    |     –    |     –    |     –    |     –    |   26.0   |
| RESPIN        |   32.5   |   21.3   |   21.6   |     –    |   12.1   |   45.6   |   27.7   |   41.1   |     –    |   32.7   |     –    |     –    |     –    |     –    |   37.5   |     –    |   30.2   |
| **Avg**       | **20.4** | **21.3** | **21.6** | **19.0** | **11.9** | **27.3** | **27.7** | **41.1** | **37.5** | **22.3** | **27.2** | **18.7** | **42.5** | **30.9** | **32.0** | **16.7** | **24.6** |

## Key Takeaways

1. **Cross-architecture generalization confirmed.** The same HEEP pipeline improves two distinct backbones: Whisper V3 Turbo (0.8B, encoder-decoder) and Qwen3-ASR (1.7B, audio-language model), without modification.

2. **Controlled multilingual evaluation.** Results span 16 languages across Indo-Aryan, Dravidian, and Classical families on standardized benchmarks with consistent evaluation protocols.

3. **Model-independent scoring.** Entropy scoring operates on MFCCs, G2P phonemes, and token distributions, not model internals. The same curated dataset was used for both backbones.

4. **Reproducibility.** Model weights, curation code, and training scripts for both backbones are at the anonymous repository.

---

## Model Overview

HEEP Universal supports transcription across **204 languages**, including a wide range of Indic and global languages, with consistent performance across various domains such as meetings, earnings calls, broadcast media, and educational content. The model is optimized for high-precision, verbatim transcription capturing spoken content word-for-word with remarkable fidelity.

**Core Insight**: Strategic selection of high-entropy samples leads to better ASR models than training on larger but redundant datasets.

## HEEP Methodology

HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources.

### Mathematical Foundation

#### Sample Score (Equation 1)

The information score for each sample combines multiple entropy dimensions:

```
S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + β·MI(x, D)
```

Where:
- `H_acoustic(x)`: Spectral/MFCC entropy measuring acoustic diversity
- `H_phonetic(x)`: Phoneme distribution entropy capturing phonetic complexity
- `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
- `H_contextual(x)`: Domain and discourse entropy
- `MI(x, D)`: Mutual information contribution relative to dataset
- `α₁...α₄, β`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)

#### Mutual Information (Equation 2)

The mutual information between acoustic features and transcription:

```
I(x, y) = Σ_{j,ℓ} p(f_j, y_ℓ) log [p(f_j, y_ℓ) / (p(f_j)·p(y_ℓ))]
```

#### Selection Criterion

Samples are selected based on a threshold:

```
D' = {x ∈ D : S(x) > τ}
```

#### Progressive Filtering (Equation 8)

The threshold increases exponentially across rounds:

```
τ_{k+1} = τ_k · growth_factor
```

#### Error-Aware Adaptation

After each training round, sample scores are adjusted based on model errors:

```
S'(x) = S(x) + λ_err·ErrorRelevance(x, errors_k) + λ_cross·CrossLingualOverlap(x)
```

### Algorithm Overview

```
Algorithm: HEEP Data Curation with Error-Aware Adaptation

Input: Dataset D, initial threshold τ₀, growth factor g
Output: Curated dataset D*

1. Initialize scorer with entropy estimators
2. Fit scorer to D (compute normalization stats, fit MI estimator)
3. D* ← D
4. k ← 0
5. While |D*| > min_samples AND k < max_rounds:
    a. For each x in D*:
        Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + β·MI(x, D)
    b. If error_patterns available:
        Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x) + λ_cross·CrossLingualOverlap(x)
    c. D* ← {x ∈ D* : S'(x) > τₖ}
    d. If train_callback: Train model on D*
    e. If eval_callback: Analyze errors, update error_patterns
    f. τₖ₊₁ ← τₖ · g
    g. k ← k + 1
6. Return D*
```

### Key Benefits

- Training on **10-20% of data** while matching or exceeding full-dataset performance
- Efficient multilingual model development with cross-lingual transfer
- Error-aware adaptive sample selection across training rounds
- Significant reduction in computational resources and training time

## Performance Benchmarks

### OpenASR Leaderboard Results

| Dataset                | WER (%) | RTFx   |
| ---------------------- | ------- | ------ |
| AMI Test               | 4.19    | 70.22  |
| Earnings22 Test        | 5.83    | 101.52 |
| GigaSpeech Test        | 4.99    | 131.09 |
| LibriSpeech Test Clean | 0.71    | 158.74 |
| LibriSpeech Test Other | 2.17    | 142.40 |
| SPGISpeech Test        | 1.10    | 170.85 |
| TedLium Test           | 1.43    | 153.34 |
| VoxPopuli Test         | 4.34    | 179.28 |

### Composite Results
- **Overall WER**: 3.10%
- **Average RTFx**: 146.23

*RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.*

## Model Details

- **Architecture**: Transformer-based encoder-decoder optimized for multilingual transcription
- **Languages**: 204 languages supported
- **Format**: Transformers compatible (safetensors)
- **Sampling Rate**: 16 kHz
- **Precision**: FP16/FP32 supported
- **Optimization**: Real-time inference capable with GPU acceleration

## Key Features

- **Exceptional Accuracy**: Achieves 3.10% WER across diverse English test sets
- **Real-Time Performance**: Average RTFx of 146.23 enables real-time applications
- **Verbatim Transcription**: Optimized for accurate, word-for-word transcription
- **Multi-Domain Excellence**: Superior performance across conversational, broadcast, and read speech
- **Multilingual Support**: 204 languages with cross-lingual transfer learning
- **HEEP-Curated Training**: Strategic entropy-based data selection for maximum information density


## Usage

```python
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "bc7ec356/heep-universal",
    torch_dtype=torch_dtype,
    use_safetensors=True,
)
model.to(device)

processor = AutoProcessor.from_pretrained("bc7ec356/heep-universal")

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

result = pipe("audio.wav")
print(result["text"])
```

## Use Cases

HEEP Universal excels in various speech recognition scenarios:

- **Meeting Transcription**: High accuracy on conversational speech (AMI: 4.19% WER)
- **Financial Communications**: Specialized performance on earnings calls (Earnings22: 5.83% WER)
- **Broadcast Media**: Excellent results on news, podcasts, and media content
- **Educational Content**: Optimized for lectures and presentations
- **Customer Support**: Accurate transcription of support calls
- **Legal Documentation**: Professional-grade accuracy for legal proceedings
- **Medical Transcription**: High-quality transcription for medical consultations

## Performance Optimization Tips

- **GPU Acceleration**: Use `device="cuda"` for significantly faster inference
- **Precision**: Set `torch_dtype=torch.float16` for optimal speed on modern GPUs
- **Language Specification**: Specify language code when known to improve accuracy and speed
- **Beam Size**: Use `beam_size=5` for best accuracy, reduce for faster inference
- **Batch Processing**: Process multiple files with a single model instance for efficiency

## Acknowledgments

HEEP Universal was developed using the HEEP framework for entropy-based data curation. We thank the open-source community for providing foundational tools that make this work possible.

## Citation

If you use this model in your research, please cite:

```bibtex
@article{anonymous2026heep,
  title={HEEP: High Entropy Exponential Pruning for State-of-the-Art ASR Through Strategic Data Curation},
  author={Anonymous},
  journal={Under Review},
  year={2026}
}
```