Vogent-Turn-80M (ONNX, re-exported)

This is a clean re-export of vogent/Vogent-Turn-80M to ONNX, produced from the upstream PyTorch weights with proper dynamic_axes annotations and the more natural input dtypes the underlying PyTorch model uses.

Weights are byte-equivalent to the upstream model โ€” only the ONNX graph signature differs.

Why this re-export exists

The official onnx-fp16/whisper-smol-lm-smaller-fp16.onnx shipped on the upstream repo was traced with a dummy input where text_len = 1 (a single-token prompt). That trace baked the size-1 dimension into many intermediate value_info entries, and ORT's CUDA execution provider then emitted a warning on every inference call:

Shape mismatch attempting to re-use buffer. {1,401} != {1,1}.
Validate usage of dim_value (values should be > 0) and dim_param
(all values with the same string should equate to the same size)
in shapes in the model.

(The 401 is attention_mask's runtime length โ€” AUDIO_TOKENS (400) + text_len. The 1 is the dummy-trace residue.)

The original export also used int32 for input_ids and attention_mask, despite the torch model using int64 natively. Most loaders work around that, but it's an extra cast on every call.

What changed

Upstream onnx-fp16/... This re-export
input_ids dtype int32 int64 (matches torch reference)
attention_mask dtype int32 int64
audio_features dtype fp16 (fp16 graph) / fp32 (fp32 graph) fp32 for both โ€” graph casts internally for the fp16 variant
dynamic_axes annotations dummy-traced (size-1 baked into intermediates) symbolic sequence_length, num_frames everywhere variable
Per-call ORT Shape mismatch warning yes (2ร— per inference call under CUDA EP) none
Logits output dtype fp16 (fp16 graph) / fp32 (fp32 graph) unchanged
Weights byte-identical byte-identical
Graph topology / accuracy reference identical

Files

onnx-fp32/whisper-smol-lm-smaller.onnx      # full-precision graph
onnx-fp16/whisper-smol-lm-smaller-fp16.onnx # half-precision graph (audio_features still fp32-in)
config.json                                  # WhisperSmolLMClassifierConfig (unchanged)
export/export_vogent_turn_onnx.py            # re-export script
export/fetch_vogent_turn_bundle.py           # original-bundle fetch script
LICENSE                                      # upstream Vogent license (carried forward)

Quick usage (Python + onnxruntime)

import numpy as np, onnxruntime as ort

sess = ort.InferenceSession("onnx-fp16/whisper-smol-lm-smaller-fp16.onnx",
                            providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

# Whisper-tiny mel: 80 mels ร— 800 frames at 16 kHz hop=160 over 8 s of audio.
audio_features = np.random.randn(1, 80, 800).astype(np.float32)
input_ids      = np.array([[1]], dtype=np.int64)        # SmolLM <|im_start|> fallback
attention_mask = np.ones((1, 400 + 1), dtype=np.int64)  # 400 audio tokens + text_len

logits = sess.run(None, {
    "input_ids":     input_ids,
    "attention_mask": attention_mask,
    "audio_features": audio_features,
})[0]
# logits is fp16 [1, 2] โ€” softmax to get (p_continue, p_endpoint).

For the chat-template / preprocessing logic, see the upstream vogent_turn repo: https://github.com/vogent/vogent-turn.

Reproducing

# 1. Fetch the original upstream bundle (gated, needs HF_TOKEN).
python export/fetch_vogent_turn_bundle.py --out models/vogent-turn-80m

# 2. Clone the official inference package (provides the model class).
git clone --depth 1 https://github.com/vogent/vogent-turn /tmp/vogent-turn-src

# 3. Re-export. Replaces onnx-fp32/* and onnx-fp16/* (.bak kept).
python export/export_vogent_turn_onnx.py \
    --model-dir models/vogent-turn-80m \
    --vogent-turn-src /tmp/vogent-turn-src

Tested with torch==2.12, transformers==4.57, onnx==1.20, onnxruntime==1.20+cuda12. The export patches transformers.masking_utils.{sdpa_mask,eager_mask} to vmap-free equivalents because the legacy torch.onnx exporter can't trace through transformers' 4-deep torch.func.vmap causal-mask builder.

License & gating

The upstream license (modified Apache-2.0) carries through. Per Vogent's gating terms, if you use this model as part of a horizontal voice-agent platform, you must not set Vogent-Turn-80M as the default turn-detector option, and users must explicitly select "Vogent Turn Detector" to enable it. See LICENSE.

Credits

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for matbee/vogent-turn-80m-onnx

Quantized
(1)
this model