MMS-300M-1130 Forced Aligner (ONNX)

Maintainer: Sancho Pança | Architect: Dom Quixote

🚀 Overview

High-performance ONNX export of the MahmoudAshraf/mms-300m-1130-forced-aligner. This model is a Wav2Vec2-based CTC engine fine-tuned specifically for Forced Alignment across 1130+ languages.

This repository provides an optimized inference path that circumvents PyTorch dependencies, offering significant memory savings and deployment flexibility.

💎 Key Features

  • Precision Timestamps: Generates word-level start/end boundaries with millisecond accuracy.
  • Massive Language Support: 1130+ languages supported via romanized text input.
  • Low-Latency Inference: Optimized via ONNX Runtime for both CPU and NVIDIA GPU (CUDA).
  • Memory Efficiency: Significantly lower RAM footprint compared to the standard torchaudio forced alignment API.
  • Quantized Performance: Includes a q8 (INT8) variant (~340MB) for ultra-fast CPU processing.

🏗️ Technical Architecture

The model utilizes the Wav2Vec2ForCTC architecture:

  • Feature Extractor: 7-layer strided CNN.
  • Total Stride: 320 samples (equivalent to 20ms per frame at 16 kHz).
  • Encoder: 24-layer Transformer (Hidden Size: 1024, 16 Attention Heads).
  • CTC Vocabulary: 31 tokens (a–z, ', <blank>, <pad>, </s>, <unk>).

📦 Model Components

File Description Footprint
model.q8.onnx Recommended - INT8 Quantized for CPU optimization. ~340 MB
model.fp32.onnx Full precision float32 for maximum numerical accuracy. ~1.2 GB
vocab.json Character-level mapping (31 tokens).
config.json Model configuration and logic ratios.
preprocessor_config.json Feature extractor normalization settings.

🛠️ Usage Specs

Audio Requirements

  • Sample Rate: 16,000 Hz (Required)
  • Channels: Mono
  • Normalization: Zero-mean, unit-variance.

Input Preprocessing

The model vocabulary is restricted to Latin characters. For non-Latin scripts (e.g., Arabic, Cyrillic, CJK), text must be romanized before alignment using the uroman tool.

I/O Schema

Input Name Type Shape Description
input_values float32 [1, N] Raw mono waveform samples.
attention_mask int64 [1, N] Constant 1 for valid audio signal.

Output:

Output Name Type Shape Description
logits float32 [1, T, 31] CTC emission log-probabilities.

📊 Output Example (JSON)

When using the companion inference script, the output follows this structure:

[
  {
    "word": "Knight",
    "start": 1.24,
    "end": 1.68,
    "score": 0.985
  },
  {
    "word": "Errant",
    "start": 1.72,
    "end": 2.15,
    "score": 0.962
  }
]
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for romara-labs/mms-300m-1130-forced-aligner-ONNX

Quantized
(2)
this model