Contextual VAD

contextual-vad is a small scikit-learn classifier that predicts higher-level voice-agent events from derived VAD, STT, TTS, and dialogue-state features.

It is designed to sit on top of a streaming STT + LLM + TTS pipeline:

audio VAD + streaming STT partials + assistant/TTS state + dialogue context
  -> contextual-vad
  -> event probabilities
  -> deterministic turn-taking policy

It produces probabilities for:

  • listening
  • speech_started
  • endpoint_candidate
  • turn_committed
  • user_resumed
  • interruption_started
  • interruption_confirmed
  • backchannel_detected
  • false_alarm

Training source: synthetic_bootstrap:n=12000:seed=7

Validation accuracy: 0.9938

Intended Use

Use this as a bootstrap policy model. Replace the synthetic bootstrap data with real call-frame logs before production use.

The model is meant to complement, not replace, acoustic VAD:

  • acoustic VAD provides fast speech-start/speech-stop signals
  • STT partials provide semantic hints about whether a thought is complete
  • TTS/assistant state helps distinguish real user barge-in from echo or backchannels
  • this model estimates the event probabilities that a state machine can turn into product behavior

Features

Input features include:

  • VAD probability, speech duration, silence duration, and energy
  • STT confidence, stable transcript length, partial transcript length, and word counts
  • semantic hints such as continuation endings and whether required slots are filled
  • assistant/TTS state, including whether the assistant is speaking and estimated echo risk
  • dialogue context such as expected answer type

Research Reference

This model is inspired by Voice Activity Projection (VAP):

Erik Ekstedt and Gabriel Skantze, "Voice Activity Projection: Self-supervised Learning of Turn-taking Events", Interspeech 2022.

VAP's key idea is to predict future voice activity and turn-taking events instead of relying only on current speech/non-speech detection. contextual-vad applies that design direction to a practical STT + LLM + TTS wrapper using lightweight tabular features.

Files

  • turn_event_model.joblib: scikit-learn pipeline.
  • training_data.csv: training rows used for this artifact.
  • feature_schema.json: feature names and defaults.
  • metrics.json: validation metrics.
  • example_payload.json: one valid inference payload.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using somukandula/contextual-vad 1

Paper for somukandula/contextual-vad