Rapida Voice System Performance Benchmarks

This post documents latency and scaling behavior of Rapida under sustained concurrent voice traffic. The intent is to show how the system behaves under realistic production conditions rather than demo workloads.

Rapida is an open source voice orchestration system designed to manage real time audio streaming, speech recognition, language model execution, and speech synthesis with full visibility into each stage.

Measurement scope

Latency was measured across the full request path observed by a caller.

Audio ingestion from telephony into the orchestrator
Time to first partial transcript from speech recognition
Time to first token from the language model
End to end response time including synthesized audio

All values are reported as p50 and p95. Mean values were not used because they hide tail latency.

Performance benchmarks

Latency versus concurrent calls

Concurrent callsAudio ingestion p50 / p95STT partial p50 / p95LLM first token p50 / p95End to end p50 / p95System notes
10< 20 ms / < 30 ms120 ms / 300 ms300 ms / 800 ms< 1.2 s / < 1.6 sSingle orchestrator instance with no queueing
50< 25 ms / < 40 ms150 ms / 350 ms400 ms / 900 ms< 1.3 s / < 1.7 sSpeech workers become CPU bound
100< 30 ms / < 60 ms180 ms / 400 ms500 ms / 1000 ms< 1.4 s / < 1.8 sSpeech layer scales horizontally
250< 40 ms / < 80 ms220 ms / 450 ms650 ms / 1200 ms< 1.5 s / < 2.0 sQueueing visible at p95
500< 60 ms / < 120 ms300 ms / 600 ms900 ms / 1600 ms< 1.8 s / < 2.4 sBackpressure active
1000< 100 ms / < 200 ms450 ms / 900 ms1400 ms / 2400 ms< 2.5 s / < 3.5 sSystem saturated without additional capacity

Benchmarks were executed on Amazon Web Services in the ap south 1 region.

Kubernetes EKS version 1.28
Orchestrator node type c6i.4xlarge
OpenSearch node type r6g.4xlarge
PostgreSQL high availability r6g.2xlarge
Redis r6g.large

Traffic was steady state with sustained concurrency rather than short bursts.

Observations

At low and moderate concurrency, response latency stays within a conversational range.

Latency increases gradually as load grows. There are no sudden collapse points where the system becomes unstable.

Speech recognition saturates before language model execution, which allows targeted scaling instead of scaling the entire stack.

Queueing and backpressure are visible in metrics before they impact most calls. This makes capacity planning predictable.

Why this matters

In production voice systems, the key questions are not peak throughput or demo latency.

  1. Where is time spent
  2. When does queueing begin
  3. Which component saturates first
  4. How p95 behaves as concurrency grows.

Rapida exposes these details directly through metrics and logs. Nothing is hidden behind managed abstractions.

Voice systems fail more often due to orchestration, scheduling, and lack of visibility than due to model quality.

Rapida focuses on these layers.
You can run it yourself.
You can reproduce these benchmarks.
You can change the system when it does not fit your needs.

Source code and documentation are available at
github.com/rapidaai/voice-ai