MMS-300M-1130 Forced Aligner (ONNX)

Maintainer: Sancho Pança | Architect: Dom Quixote

🚀 Overview

High-performance ONNX export of the MahmoudAshraf/mms-300m-1130-forced-aligner. This model is a Wav2Vec2-based CTC engine fine-tuned specifically for Forced Alignment across 1130+ languages.

This repository provides an optimized inference path that circumvents PyTorch dependencies, offering significant memory savings and deployment flexibility.

💎 Key Features

Precision Timestamps: Generates word-level start/end boundaries with millisecond accuracy.
Massive Language Support: 1130+ languages supported via romanized text input.
Low-Latency Inference: Optimized via ONNX Runtime for both CPU and NVIDIA GPU (CUDA).
Memory Efficiency: Significantly lower RAM footprint compared to the standard torchaudio forced alignment API.
Quantized Performance: Includes a q8 (INT8) variant (~340MB) for ultra-fast CPU processing.

🏗️ Technical Architecture

The model utilizes the Wav2Vec2ForCTC architecture:

Feature Extractor: 7-layer strided CNN.
Total Stride: 320 samples (equivalent to 20ms per frame at 16 kHz).
Encoder: 24-layer Transformer (Hidden Size: 1024, 16 Attention Heads).
CTC Vocabulary: 31 tokens (a–z, ', <blank>, <pad>, </s>, <unk>).

📦 Model Components

File	Description	Footprint
`model.q8.onnx`	Recommended - INT8 Quantized for CPU optimization.	~340 MB
`model.fp32.onnx`	Full precision float32 for maximum numerical accuracy.	~1.2 GB
`vocab.json`	Character-level mapping (31 tokens).	—
`config.json`	Model configuration and logic ratios.	—
`preprocessor_config.json`	Feature extractor normalization settings.	—

🛠️ Usage Specs

Audio Requirements

Sample Rate: 16,000 Hz (Required)
Channels: Mono
Normalization: Zero-mean, unit-variance.

Input Preprocessing

The model vocabulary is restricted to Latin characters. For non-Latin scripts (e.g., Arabic, Cyrillic, CJK), text must be romanized before alignment using the uroman tool.

I/O Schema

Input Name	Type	Shape	Description
`input_values`	`float32`	`[1, N]`	Raw mono waveform samples.
`attention_mask`	`int64`	`[1, N]`	Constant 1 for valid audio signal.

Output:

Output Name	Type	Shape	Description
`logits`	`float32`	`[1, T, 31]`	CTC emission log-probabilities.

📊 Output Example (JSON)

When using the companion inference script, the output follows this structure:

[
  {
    "word": "Knight",
    "start": 1.24,
    "end": 1.68,
    "score": 0.985
  },
  {
    "word": "Errant",
    "start": 1.72,
    "end": 2.15,
    "score": 0.962
  }
]

Downloads last month: 5

Model tree for romara-labs/mms-300m-1130-forced-aligner-ONNX

Base model

MahmoudAshraf/mms-300m-1130-forced-aligner

Quantized

(2)

this model