Chat Completions vs Responses vs WebSocket: TTFT and TRT in Rapida

Chat Completions vs Responses vs WebSocket: TTFT and TRT in Rapida

In voice AI, 300–500ms latency differences completely change conversation quality. Yet most teams still benchmark only tokens/sec instead of end-to-end conversational responsiveness.

LLMs are evolving fast, and so is OpenAI. Every few months, the model layer changes in a meaningful way: better reasoning, better tool calling, better streaming behavior, new APIs, and new
request patterns. For most software products, that usually means better answers. For voice AI, it means something more basic and more unforgiving: how quickly can the system start speaking
back?

Voice does not behave like chat. In a chat product, a few hundred milliseconds can disappear behind a typing indicator. In a live call, that same delay becomes silence. The caller finishes
speaking, waits for the assistant, and the system either feels responsive or it feels broken. That is why we keep optimizing the LLM path inside Rapida. Not just by changing models or
tuning prompts, but by changing how we connect to the model.

A production voice pipeline is a chain: user audio comes in, STT turns it into text, the LLM generates the next response, TTS converts that response back into audio, and the audio is
streamed back to the caller. The LLM sits directly in the middle of that path. If it is slow to produce its first token, TTS cannot start early. If TTS cannot start early, audio cannot go
back to the caller. That is where the silence comes from.

This is why I care about two latency numbers more than generic response time: TTFT and TRT. TTFT, time to first token, tells us when the LLM starts producing usable output. TRT, total
response time, tells us when the full model response is complete. Both matter, but TTFT matters more for perceived latency in voice. The sooner the first token arrives, the sooner the rest
of the voice pipeline can move.

In Rapida, we added a way to configure the OpenAI transport path directly from the assistant model settings. The setting is connection.transport, and it currently supports chat_complete,
chat_response, and websocket. That means the same assistant can use the same OpenAI model, same prompt, same token budget, and same runtime configuration, while routing the request through
a different caller path.

That matters because it lets us measure the transport path without changing the rest of the system. At the backend level, chat_complete routes through openai/chat_complete, chat_response
routes through openai/chat_response, and websocket routes through openai/websocket_streamer. Same model. Same assistant. Different path into the model.

For the first benchmark, I used gpt-4o and measured response timing across the OpenAI transport options inside Rapida. The setup was intentionally simple: same assistant, same prompt, same
token budget, same OpenAI provider. The only thing changing was the transport path.

The result was clear:

Transport TTFT p50 TTFT p95 TRT p50 TRT p95
Chat Complete 481.92 ms 516.38 ms 531.92 ms 566.38 ms
Chat Response 437.59 ms 469.18 ms 487.59 ms 519.18 ms
WebSocket 393.27 ms 421.98 ms 443.27 ms 471.98 ms

The important part is not just the table. It is what the table changes in the voice pipeline. A lower TTFT means Rapida can begin the downstream TTS path sooner. That means the caller
hears the assistant sooner. In text products, this kind of difference might look like a small optimization. In voice AI, it changes how natural the conversation feels.

This is the broader engineering lesson: voice AI latency is not one number. It is a chain of small delays stacked together. STT latency decides when the transcript is ready. LLM TTFT
decides when the response can start. TTS TTFB decides when audio can begin. Streaming decides whether the caller hears a smooth response or waits for a delayed block. If any part waits too
long, the whole call feels slow.

That is why Rapida is built around streaming wherever possible. We stream audio in. We stream transcripts. We stream LLM deltas. We stream TTS audio back. The OpenAI transport work follows
the same principle. The LLM should not be treated as a black box with one fixed request path. Different APIs have different behavior, different event shapes, and different latency
characteristics. In a real-time voice system, those details matter.

The next step is to keep expanding the benchmark. I want larger runs across each transport, with the same model, same prompt, same system message, same token budget, same service tier, and
same prompt cache settings. I also want separate runs with tools disabled and with function calling enabled, because tool use changes the response shape and the streaming behavior.

The larger point is simple: voice AI latency is not only a model problem. It is an architecture problem. The model matters. The prompt matters. The provider matters. But the transport path
also matters, especially when every extra millisecond turns into silence on a live call.

That is why we are making this configurable in Rapida. If you are building production voice agents, you should measure this too.