MMS-300M-1130 Forced Aligner (ONNX)
Maintainer: Sancho Pança | Architect: Dom Quixote
🚀 Overview
High-performance ONNX export of the MahmoudAshraf/mms-300m-1130-forced-aligner. This model is a Wav2Vec2-based CTC engine fine-tuned specifically for Forced Alignment across 1130+ languages.
This repository provides an optimized inference path that circumvents PyTorch dependencies, offering significant memory savings and deployment flexibility.
💎 Key Features
- Precision Timestamps: Generates word-level start/end boundaries with millisecond accuracy.
- Massive Language Support: 1130+ languages supported via romanized text input.
- Low-Latency Inference: Optimized via ONNX Runtime for both CPU and NVIDIA GPU (CUDA).
- Memory Efficiency: Significantly lower RAM footprint compared to the standard
torchaudioforced alignment API. - Quantized Performance: Includes a
q8(INT8) variant (~340MB) for ultra-fast CPU processing.
🏗️ Technical Architecture
The model utilizes the Wav2Vec2ForCTC architecture:
- Feature Extractor: 7-layer strided CNN.
- Total Stride: 320 samples (equivalent to 20ms per frame at 16 kHz).
- Encoder: 24-layer Transformer (Hidden Size: 1024, 16 Attention Heads).
- CTC Vocabulary: 31 tokens (
a–z,',<blank>,<pad>,</s>,<unk>).
📦 Model Components
| File | Description | Footprint |
|---|---|---|
model.q8.onnx |
Recommended - INT8 Quantized for CPU optimization. | ~340 MB |
model.fp32.onnx |
Full precision float32 for maximum numerical accuracy. | ~1.2 GB |
vocab.json |
Character-level mapping (31 tokens). | — |
config.json |
Model configuration and logic ratios. | — |
preprocessor_config.json |
Feature extractor normalization settings. | — |
🛠️ Usage Specs
Audio Requirements
- Sample Rate: 16,000 Hz (Required)
- Channels: Mono
- Normalization: Zero-mean, unit-variance.
Input Preprocessing
The model vocabulary is restricted to Latin characters. For non-Latin scripts (e.g., Arabic, Cyrillic, CJK), text must be romanized before alignment using the uroman tool.
I/O Schema
| Input Name | Type | Shape | Description |
|---|---|---|---|
input_values |
float32 |
[1, N] |
Raw mono waveform samples. |
attention_mask |
int64 |
[1, N] |
Constant 1 for valid audio signal. |
Output:
| Output Name | Type | Shape | Description |
|---|---|---|---|
logits |
float32 |
[1, T, 31] |
CTC emission log-probabilities. |
📊 Output Example (JSON)
When using the companion inference script, the output follows this structure:
[
{
"word": "Knight",
"start": 1.24,
"end": 1.68,
"score": 0.985
},
{
"word": "Errant",
"start": 1.72,
"end": 2.15,
"score": 0.962
}
]
- Downloads last month
- 5
Model tree for romara-labs/mms-300m-1130-forced-aligner-ONNX
Base model
MahmoudAshraf/mms-300m-1130-forced-aligner