Comparing Approaches for End-of-Speech Detection in Voice AI

Comparing Approaches for End-of-Speech Detection in Voice AI
Voice Activity Detection

While building RapidaAI, one of the biggest challenges we are navigating is figuring out when a user has actually finished speaking — especially across different languages, accents, and natural speaking styles.

Get it wrong, and the conversation feels off: interrupt too soon, and the user has to repeat themselves; wait too long, and the chat feels slow and clunky. In a real-time voice system, even a 300 ms delay can make the interaction feel broken. For enterprise-grade Voice AI orchestration, this timing precision directly affects conversion rates, call durations, and overall experience.

For example:

  • A user giving a phone number in Hindi or Indonesian: “Mera number hai 965-22-4660…”
  • A user giving an email: “You can reach me at prash...ant[at]...rapidaai.. uhh.. rapida.ai”
  • A user giving flight details in multiple parts: “I’m... flying from San Francisco… arriving in New York on Tuesday…”

At Rapida, we’re actively tackling these situations, and it becomes even trickier in multilingual or code-switched conversations. Getting End-of-Speech (EOS) detection right isn’t just a UX nicety — it’s foundational infrastructure for natural, latency-free conversations across phone, web, and embedded voice interfaces.

In this post, we’ll look at the main approaches we’ve tried, what we’ve learned, and how we’re improving RapidaAI’s multimodal EOS engine to make conversations smoother and more human across environments.

Voice Activity Detection (VAD)

Voice Activity Detection (VAD) is one of the oldest ways to detect when someone stops speaking. It listens for silence and decides if the turn has ended—usually after a fixed gap like 500–700 ms.

We tried both static and dynamic time frames, where the system waits longer if a user tends to pause more often. But in real phone-call scenarios, every millisecond matters. A short delay makes the system feel slow; reacting too early cuts off the user mid-sentence.

ScenarioLanguageStatic / Dynamic VAD Result
Phone number (“Mera number hai 987-654…”)HindiCut off mid-speech
Email spelling (“a-l-i-c-e @ gmail…”)IndonesianEarly trigger
Short query (“What’s the time?”)EnglishWorks fine
Background noiseHindiOccasional false triggers


VAD works for clean, short utterances—but in live calls where timing precision drives user experience, it simply isn’t reliable enough.

Transformer-Based End-of-Speech Models

Text-only transformer EOS models, like LiveKit Turn Detector, work well for standard English sentences, but fail in real-world phone calls when users provide structured or mixed-language information, or when transcription quality drops.

For our benchmarks, we used Deepgram transcript confidence ≥0.5, meaning only tokens with sufficient STT confidence are considered. This reflects realistic production scenarios where low-confidence transcriptions are ignored or flagged.

LiveKit combines VAD signals with a small language model (e.g., SmolLM2-135M for English or Qwen2.5-0.5B for multilingual).

Benchmark: Common Failure Scenarios

ScenarioLanguageTurn Length (sec)Context Window (tokens)AccuracyAvg. LatencyNotes
Short English sentenceEnglish2.830098.7%18 msHigh STT confidence
Multi-clause reasoningEnglish4.350095.8%33 msMinor hesitation handled
Phone number (“987-654-3210”)Hindi-English5.055072.4%47 msMisinterpreted digits, early cut-off
Email (“prashant@rapidaai.com”)Indonesian-English6.260069.1%58 msSymbols misrecognized, code-switching
Credit card / account numbersHindi-English7.060065.3%60 msNumeric sequences split into multiple turns
Spelled words / alphanumeric codesEnglish + local6.555068.7%55 msLetters interpreted as separate tokens
Long enumerations (“I have apples, bananas, mangoes…”)English5.850071.2%52 msPauses between items misread as EOS
Background noise + mixed languageHindi-English6.060064.9%57 msSTT errors cause early or late triggers

Text-only EOS models with LiveKit Turn Detector perform reliably for English, but fail with structured info or code-switched speech.

Multimodal Models (Audio + Text)

The recently released VoTurn‑80M combines current audio (~8 s) with past text context to detect end-of-speech. Audio and text embeddings are fed into an ablated small language model (SmolLM-132, first 12 layers, ~80M parameters) with a linear classification head.

Example:

  • Previous line: “What is your phone number”
  • Current line (audio transcript, punctuation omitted): “987 654 3210”

Benchmark (phone-call scenarios, Deepgram confidence ≥0.5)

ScenarioLanguageTurn LengthAccuracyLatency
Phone numberHindi + English5 s90%28 ms
EmailIndonesian + English6 s88%32 ms
Code-switched reasoningHindi + English5.8 s92%30 ms
Non-English (pure Hindi/Indonesian)Local5 s65%35 ms
Short English queryEnglish1.7 s97%22 ms

Multimodal EOS fixes many English + structured info failures (numbers, emails).
Performance drops sharply for pure local languages or low-resource code-switched turns.
Latency is higher than text-only, but still good enough for real-time calls.

End-of-Speech Detection Benchmark

ApproachScenarioLanguageTurn LengthContext (tokens / audio)AccuracyAvg. Latency
VAD (time-frame based)Short English sentenceEnglish2.8 sNone85%5 ms
VADMulti-clause reasoningEnglish4.3 sNone70%5 ms
VADPhone numberHindi5 sNone50%5 ms
VADPhone numberIndonesian5 sNone52%5 ms
VADEmailHindi6 sNone48%5 ms
VADEmailIndonesian6 sNone50%5 ms
Transformer-only / LiveKitShort English sentenceEnglish2.8 s300 tokens98.7%18 ms
Transformer-onlyMulti-clause reasoningEnglish4.3 s500 tokens95.8%33 ms
Transformer-onlyPhone numberHindi5 s550 tokens72%47 ms
Transformer-onlyPhone numberIndonesian5 s550 tokens70%48 ms
Transformer-onlyEmailHindi6 s600 tokens69%58 ms
Transformer-onlyEmailIndonesian6 s600 tokens68%59 ms
Multimodal / VoTurn‑80MPhone numberHindi5 s550 tokens + 8 s audio90%28 ms
MultimodalPhone numberIndonesian5 s550 tokens + 8 s audio88%30 ms
MultimodalEmailHindi6 s600 tokens + 8 s audio88%32 ms
MultimodalEmailIndonesian6 s600 tokens + 8 s audio87%33 ms
MultimodalCode-switched reasoningHindi + English5.8 s550 tokens + 8 s audio92%30 ms
MultimodalShort English queryEnglish1.7 s300 tokens + 3 s audio97%22 ms

At Rapida, we’re experimenting with a multi-modal approach that blends audio, text, and contextual cues to handle real-world voice interactions better. While multimodal models show clear gains, we’re also keeping the transformer-based path active — combining both to shape the next generation of end-of-speech detection that works reliably across languages and environments.

While multimodal models like VoTurn-80M show clear gains, we’re also evolving a hybrid transformer + VAD pipeline that works consistently across languages, accents, and network conditions — all while running fully on-prem or within private clouds, ensuring no data egress for regulated deployments.

This is a small but vital piece in making real-time voice AI infrastructure production-grade — fast, private, and truly multilingual.

If you’re building in this space — from CPaaS, contact center AI, or on-device assistants — we’d love to exchange notes or benchmark performance.