Comparing Approaches for End-of-Speech Detection in Voice AI
While building RapidaAI, one of the biggest challenges we are navigating is figuring out when a user has actually finished speaking — especially across different languages, accents, and natural speaking styles.
Get it wrong, and the conversation feels off: interrupt too soon, and the user has to repeat themselves; wait too long, and the chat feels slow and clunky. In a real-time voice system, even a 300 ms delay can make the interaction feel broken. For enterprise-grade Voice AI orchestration, this timing precision directly affects conversion rates, call durations, and overall experience.
For example:
- A user giving a phone number in Hindi or Indonesian: “Mera number hai 965-22-4660…”
- A user giving an email: “You can reach me at prash...ant[at]...rapidaai.. uhh.. rapida.ai”
- A user giving flight details in multiple parts: “I’m... flying from San Francisco… arriving in New York on Tuesday…”
At Rapida, we’re actively tackling these situations, and it becomes even trickier in multilingual or code-switched conversations. Getting End-of-Speech (EOS) detection right isn’t just a UX nicety — it’s foundational infrastructure for natural, latency-free conversations across phone, web, and embedded voice interfaces.
In this post, we’ll look at the main approaches we’ve tried, what we’ve learned, and how we’re improving RapidaAI’s multimodal EOS engine to make conversations smoother and more human across environments.
Voice Activity Detection (VAD)
Voice Activity Detection (VAD) is one of the oldest ways to detect when someone stops speaking. It listens for silence and decides if the turn has ended—usually after a fixed gap like 500–700 ms.
We tried both static and dynamic time frames, where the system waits longer if a user tends to pause more often. But in real phone-call scenarios, every millisecond matters. A short delay makes the system feel slow; reacting too early cuts off the user mid-sentence.
| Scenario | Language | Static / Dynamic VAD Result |
|---|---|---|
| Phone number (“Mera number hai 987-654…”) | Hindi | Cut off mid-speech |
| Email spelling (“a-l-i-c-e @ gmail…”) | Indonesian | Early trigger |
| Short query (“What’s the time?”) | English | Works fine |
| Background noise | Hindi | Occasional false triggers |
VAD works for clean, short utterances—but in live calls where timing precision drives user experience, it simply isn’t reliable enough.
Transformer-Based End-of-Speech Models
Text-only transformer EOS models, like LiveKit Turn Detector, work well for standard English sentences, but fail in real-world phone calls when users provide structured or mixed-language information, or when transcription quality drops.
For our benchmarks, we used Deepgram transcript confidence ≥0.5, meaning only tokens with sufficient STT confidence are considered. This reflects realistic production scenarios where low-confidence transcriptions are ignored or flagged.
LiveKit combines VAD signals with a small language model (e.g., SmolLM2-135M for English or Qwen2.5-0.5B for multilingual).
Benchmark: Common Failure Scenarios
| Scenario | Language | Turn Length (sec) | Context Window (tokens) | Accuracy | Avg. Latency | Notes |
|---|---|---|---|---|---|---|
| Short English sentence | English | 2.8 | 300 | 98.7% | 18 ms | High STT confidence |
| Multi-clause reasoning | English | 4.3 | 500 | 95.8% | 33 ms | Minor hesitation handled |
| Phone number (“987-654-3210”) | Hindi-English | 5.0 | 550 | 72.4% | 47 ms | Misinterpreted digits, early cut-off |
| Email (“prashant@rapidaai.com”) | Indonesian-English | 6.2 | 600 | 69.1% | 58 ms | Symbols misrecognized, code-switching |
| Credit card / account numbers | Hindi-English | 7.0 | 600 | 65.3% | 60 ms | Numeric sequences split into multiple turns |
| Spelled words / alphanumeric codes | English + local | 6.5 | 550 | 68.7% | 55 ms | Letters interpreted as separate tokens |
| Long enumerations (“I have apples, bananas, mangoes…”) | English | 5.8 | 500 | 71.2% | 52 ms | Pauses between items misread as EOS |
| Background noise + mixed language | Hindi-English | 6.0 | 600 | 64.9% | 57 ms | STT errors cause early or late triggers |
Text-only EOS models with LiveKit Turn Detector perform reliably for English, but fail with structured info or code-switched speech.
Multimodal Models (Audio + Text)
The recently released VoTurn‑80M combines current audio (~8 s) with past text context to detect end-of-speech. Audio and text embeddings are fed into an ablated small language model (SmolLM-132, first 12 layers, ~80M parameters) with a linear classification head.
Example:
- Previous line: “What is your phone number”
- Current line (audio transcript, punctuation omitted): “987 654 3210”
Benchmark (phone-call scenarios, Deepgram confidence ≥0.5)
| Scenario | Language | Turn Length | Accuracy | Latency |
|---|---|---|---|---|
| Phone number | Hindi + English | 5 s | 90% | 28 ms |
| Indonesian + English | 6 s | 88% | 32 ms | |
| Code-switched reasoning | Hindi + English | 5.8 s | 92% | 30 ms |
| Non-English (pure Hindi/Indonesian) | Local | 5 s | 65% | 35 ms |
| Short English query | English | 1.7 s | 97% | 22 ms |
Multimodal EOS fixes many English + structured info failures (numbers, emails).
Performance drops sharply for pure local languages or low-resource code-switched turns.
Latency is higher than text-only, but still good enough for real-time calls.
End-of-Speech Detection Benchmark
| Approach | Scenario | Language | Turn Length | Context (tokens / audio) | Accuracy | Avg. Latency |
|---|---|---|---|---|---|---|
| VAD (time-frame based) | Short English sentence | English | 2.8 s | None | 85% | 5 ms |
| VAD | Multi-clause reasoning | English | 4.3 s | None | 70% | 5 ms |
| VAD | Phone number | Hindi | 5 s | None | 50% | 5 ms |
| VAD | Phone number | Indonesian | 5 s | None | 52% | 5 ms |
| VAD | Hindi | 6 s | None | 48% | 5 ms | |
| VAD | Indonesian | 6 s | None | 50% | 5 ms | |
| Transformer-only / LiveKit | Short English sentence | English | 2.8 s | 300 tokens | 98.7% | 18 ms |
| Transformer-only | Multi-clause reasoning | English | 4.3 s | 500 tokens | 95.8% | 33 ms |
| Transformer-only | Phone number | Hindi | 5 s | 550 tokens | 72% | 47 ms |
| Transformer-only | Phone number | Indonesian | 5 s | 550 tokens | 70% | 48 ms |
| Transformer-only | Hindi | 6 s | 600 tokens | 69% | 58 ms | |
| Transformer-only | Indonesian | 6 s | 600 tokens | 68% | 59 ms | |
| Multimodal / VoTurn‑80M | Phone number | Hindi | 5 s | 550 tokens + 8 s audio | 90% | 28 ms |
| Multimodal | Phone number | Indonesian | 5 s | 550 tokens + 8 s audio | 88% | 30 ms |
| Multimodal | Hindi | 6 s | 600 tokens + 8 s audio | 88% | 32 ms | |
| Multimodal | Indonesian | 6 s | 600 tokens + 8 s audio | 87% | 33 ms | |
| Multimodal | Code-switched reasoning | Hindi + English | 5.8 s | 550 tokens + 8 s audio | 92% | 30 ms |
| Multimodal | Short English query | English | 1.7 s | 300 tokens + 3 s audio | 97% | 22 ms |
At Rapida, we’re experimenting with a multi-modal approach that blends audio, text, and contextual cues to handle real-world voice interactions better. While multimodal models show clear gains, we’re also keeping the transformer-based path active — combining both to shape the next generation of end-of-speech detection that works reliably across languages and environments.
While multimodal models like VoTurn-80M show clear gains, we’re also evolving a hybrid transformer + VAD pipeline that works consistently across languages, accents, and network conditions — all while running fully on-prem or within private clouds, ensuring no data egress for regulated deployments.
This is a small but vital piece in making real-time voice AI infrastructure production-grade — fast, private, and truly multilingual.
If you’re building in this space — from CPaaS, contact center AI, or on-device assistants — we’d love to exchange notes or benchmark performance.