By Prashant Srivastav in Knowledge — 30 Oct 2025

Comparing Approaches for End-of-Speech Detection in Voice AI

Voice Activity Detection

While building RapidaAI, one of the biggest challenges we are navigating is figuring out when a user has actually finished speaking — especially across different languages, accents, and natural speaking styles.

Get it wrong, and the conversation feels off: interrupt too soon, and the user has to repeat themselves; wait too long, and the chat feels slow and clunky. In a real-time voice system, even a 300 ms delay can make the interaction feel broken. For enterprise-grade Voice AI orchestration, this timing precision directly affects conversion rates, call durations, and overall experience.

For example:

A user giving a phone number in Hindi or Indonesian: “Mera number hai 965-22-4660…”
A user giving an email: “You can reach me at prash...ant[at]...rapidaai.. uhh.. rapida.ai”
A user giving flight details in multiple parts: “I’m... flying from San Francisco… arriving in New York on Tuesday…”

At Rapida, we’re actively tackling these situations, and it becomes even trickier in multilingual or code-switched conversations. Getting End-of-Speech (EOS) detection right isn’t just a UX nicety — it’s foundational infrastructure for natural, latency-free conversations across phone, web, and embedded voice interfaces.

In this post, we’ll look at the main approaches we’ve tried, what we’ve learned, and how we’re improving RapidaAI’s multimodal EOS engine to make conversations smoother and more human across environments.

Voice Activity Detection (VAD)

Voice Activity Detection (VAD) is one of the oldest ways to detect when someone stops speaking. It listens for silence and decides if the turn has ended—usually after a fixed gap like 500–700 ms.

We tried both static and dynamic time frames, where the system waits longer if a user tends to pause more often. But in real phone-call scenarios, every millisecond matters. A short delay makes the system feel slow; reacting too early cuts off the user mid-sentence.

Scenario	Language	Static / Dynamic VAD Result
Phone number (“Mera number hai 987-654…”)	Hindi	Cut off mid-speech
Email spelling (“a-l-i-c-e @ gmail…”)	Indonesian	Early trigger
Short query (“What’s the time?”)	English	Works fine
Background noise	Hindi	Occasional false triggers

VAD works for clean, short utterances—but in live calls where timing precision drives user experience, it simply isn’t reliable enough.

Transformer-Based End-of-Speech Models

Text-only transformer EOS models, like LiveKit Turn Detector, work well for standard English sentences, but fail in real-world phone calls when users provide structured or mixed-language information, or when transcription quality drops.

For our benchmarks, we used Deepgram transcript confidence ≥0.5, meaning only tokens with sufficient STT confidence are considered. This reflects realistic production scenarios where low-confidence transcriptions are ignored or flagged.

LiveKit combines VAD signals with a small language model (e.g., SmolLM2-135M for English or Qwen2.5-0.5B for multilingual).

Benchmark: Common Failure Scenarios

Scenario	Language	Turn Length (sec)	Context Window (tokens)	Accuracy	Avg. Latency	Notes
Short English sentence	English	2.8	300	98.7%	18 ms	High STT confidence
Multi-clause reasoning	English	4.3	500	95.8%	33 ms	Minor hesitation handled
Phone number (“987-654-3210”)	Hindi-English	5.0	550	72.4%	47 ms	Misinterpreted digits, early cut-off
Email (“prashant@rapidaai.com”)	Indonesian-English	6.2	600	69.1%	58 ms	Symbols misrecognized, code-switching
Credit card / account numbers	Hindi-English	7.0	600	65.3%	60 ms	Numeric sequences split into multiple turns
Spelled words / alphanumeric codes	English + local	6.5	550	68.7%	55 ms	Letters interpreted as separate tokens
Long enumerations (“I have apples, bananas, mangoes…”)	English	5.8	500	71.2%	52 ms	Pauses between items misread as EOS
Background noise + mixed language	Hindi-English	6.0	600	64.9%	57 ms	STT errors cause early or late triggers

Text-only EOS models with LiveKit Turn Detector perform reliably for English, but fail with structured info or code-switched speech.

Multimodal Models (Audio + Text)

The recently released VoTurn‑80M combines current audio (~8 s) with past text context to detect end-of-speech. Audio and text embeddings are fed into an ablated small language model (SmolLM-132, first 12 layers, ~80M parameters) with a linear classification head.

Example:

Previous line: “What is your phone number”
Current line (audio transcript, punctuation omitted): “987 654 3210”

Benchmark (phone-call scenarios, Deepgram confidence ≥0.5)

Scenario	Language	Turn Length	Accuracy	Latency
Phone number	Hindi + English	5 s	90%	28 ms
Email	Indonesian + English	6 s	88%	32 ms
Code-switched reasoning	Hindi + English	5.8 s	92%	30 ms
Non-English (pure Hindi/Indonesian)	Local	5 s	65%	35 ms
Short English query	English	1.7 s	97%	22 ms

Multimodal EOS fixes many English + structured info failures (numbers, emails).
Performance drops sharply for pure local languages or low-resource code-switched turns.
Latency is higher than text-only, but still good enough for real-time calls.

End-of-Speech Detection Benchmark

Approach	Scenario	Language	Turn Length	Context (tokens / audio)	Accuracy	Avg. Latency
VAD (time-frame based)	Short English sentence	English	2.8 s	None	85%	5 ms
VAD	Multi-clause reasoning	English	4.3 s	None	70%	5 ms
VAD	Phone number	Hindi	5 s	None	50%	5 ms
VAD	Phone number	Indonesian	5 s	None	52%	5 ms
VAD	Email	Hindi	6 s	None	48%	5 ms
VAD	Email	Indonesian	6 s	None	50%	5 ms
Transformer-only / LiveKit	Short English sentence	English	2.8 s	300 tokens	98.7%	18 ms
Transformer-only	Multi-clause reasoning	English	4.3 s	500 tokens	95.8%	33 ms
Transformer-only	Phone number	Hindi	5 s	550 tokens	72%	47 ms
Transformer-only	Phone number	Indonesian	5 s	550 tokens	70%	48 ms
Transformer-only	Email	Hindi	6 s	600 tokens	69%	58 ms
Transformer-only	Email	Indonesian	6 s	600 tokens	68%	59 ms
Multimodal / VoTurn‑80M	Phone number	Hindi	5 s	550 tokens + 8 s audio	90%	28 ms
Multimodal	Phone number	Indonesian	5 s	550 tokens + 8 s audio	88%	30 ms
Multimodal	Email	Hindi	6 s	600 tokens + 8 s audio	88%	32 ms
Multimodal	Email	Indonesian	6 s	600 tokens + 8 s audio	87%	33 ms
Multimodal	Code-switched reasoning	Hindi + English	5.8 s	550 tokens + 8 s audio	92%	30 ms
Multimodal	Short English query	English	1.7 s	300 tokens + 3 s audio	97%	22 ms

At Rapida, we’re experimenting with a multi-modal approach that blends audio, text, and contextual cues to handle real-world voice interactions better. While multimodal models show clear gains, we’re also keeping the transformer-based path active — combining both to shape the next generation of end-of-speech detection that works reliably across languages and environments.

While multimodal models like VoTurn-80M show clear gains, we’re also evolving a hybrid transformer + VAD pipeline that works consistently across languages, accents, and network conditions — all while running fully on-prem or within private clouds, ensuring no data egress for regulated deployments.

This is a small but vital piece in making real-time voice AI infrastructure production-grade — fast, private, and truly multilingual.

If you’re building in this space — from CPaaS, contact center AI, or on-device assistants — we’d love to exchange notes or benchmark performance.

Comparing Approaches for End-of-Speech Detection in Voice AI

Voice Activity Detection (VAD)

Transformer-Based End-of-Speech Models

Benchmark: Common Failure Scenarios

Multimodal Models (Audio + Text)

Benchmark (phone-call scenarios, Deepgram confidence ≥0.5)

End-of-Speech Detection Benchmark

Voice is the growth engine in a $100B CPaaS future

Scaling Voice AI in India: The Hard Part

Voice Activity Detection (VAD)

Transformer-Based End-of-Speech Models

Benchmark: Common Failure Scenarios

Multimodal Models (Audio + Text)

Benchmark (phone-call scenarios, Deepgram confidence ≥0.5)

End-of-Speech Detection Benchmark

Voice is the growth engine in a $100B CPaaS future

Scaling Voice AI in India: The Hard Part

You might also like...