By Prashant Srivastav in Talk — 23 Dec 2025

Rapida Voice System Performance Benchmarks

This post documents latency and scaling behavior of Rapida under sustained concurrent voice traffic. The intent is to show how the system behaves under realistic production conditions rather than demo workloads.

Rapida is an open source voice orchestration system designed to manage real time audio streaming, speech recognition, language model execution, and speech synthesis with full visibility into each stage.

Measurement scope

Latency was measured across the full request path observed by a caller.

Audio ingestion from telephony into the orchestrator
Time to first partial transcript from speech recognition
Time to first token from the language model
End to end response time including synthesized audio

All values are reported as p50 and p95. Mean values were not used because they hide tail latency.

Performance benchmarks

Latency versus concurrent calls

Concurrent calls	Audio ingestion p50 / p95	STT partial p50 / p95	LLM first token p50 / p95	End to end p50 / p95	System notes
10	< 20 ms / < 30 ms	120 ms / 300 ms	300 ms / 800 ms	< 1.2 s / < 1.6 s	Single orchestrator instance with no queueing
50	< 25 ms / < 40 ms	150 ms / 350 ms	400 ms / 900 ms	< 1.3 s / < 1.7 s	Speech workers become CPU bound
100	< 30 ms / < 60 ms	180 ms / 400 ms	500 ms / 1000 ms	< 1.4 s / < 1.8 s	Speech layer scales horizontally
250	< 40 ms / < 80 ms	220 ms / 450 ms	650 ms / 1200 ms	< 1.5 s / < 2.0 s	Queueing visible at p95
500	< 60 ms / < 120 ms	300 ms / 600 ms	900 ms / 1600 ms	< 1.8 s / < 2.4 s	Backpressure active
1000	< 100 ms / < 200 ms	450 ms / 900 ms	1400 ms / 2400 ms	< 2.5 s / < 3.5 s	System saturated without additional capacity

Benchmarks were executed on Amazon Web Services in the ap south 1 region.

Kubernetes EKS version 1.28
Orchestrator node type c6i.4xlarge
OpenSearch node type r6g.4xlarge
PostgreSQL high availability r6g.2xlarge
Redis r6g.large

Traffic was steady state with sustained concurrency rather than short bursts.

Observations

At low and moderate concurrency, response latency stays within a conversational range.

Latency increases gradually as load grows. There are no sudden collapse points where the system becomes unstable.

Speech recognition saturates before language model execution, which allows targeted scaling instead of scaling the entire stack.

Queueing and backpressure are visible in metrics before they impact most calls. This makes capacity planning predictable.

Why this matters

In production voice systems, the key questions are not peak throughput or demo latency.

Where is time spent
When does queueing begin
Which component saturates first
How p95 behaves as concurrency grows.

Rapida exposes these details directly through metrics and logs. Nothing is hidden behind managed abstractions.

Voice systems fail more often due to orchestration, scheduling, and lack of visibility than due to model quality.

Rapida focuses on these layers.
You can run it yourself.
You can reproduce these benchmarks.
You can change the system when it does not fit your needs.

Source code and documentation are available at
github.com/rapidaai/voice-ai

Rapida Voice System Performance Benchmarks

Measurement scope

Performance benchmarks

Latency versus concurrent calls

Observations

Why this matters

Transforming Customer Service with Voice AI: From IVR to Intelligent Agents

The Quality gap nobody talks about

Measurement scope

Performance benchmarks

Latency versus concurrent calls

Observations

Why this matters

Transforming Customer Service with Voice AI: From IVR to Intelligent Agents

The Quality gap nobody talks about

You might also like...