What to expect
| Measurement | Typical value | What it includes |
|---|---|---|
| Inference TTFA | ~40–60 ms | Server-side only: model time from text chunk to first audio chunk. The floor for any deployment. |
| Warm end-to-end TTFA | ~100–150 ms | What a same-region client sees on a pre-connected socket with language set: network RTT + normalization + inference + first chunk delivery. Co-located (in-cluster) clients see as little as ~60 ms. |
| Cold first request | warm + ~100–500 ms | Adds the TCP + TLS + WebSocket handshake. The handshake costs several network round-trips, so it scales with your distance to the API: ~100–250 ms same-region, up to ~500 ms cross-continent. Pre-connect to take it off the hot path entirely. |
| Language auto-detection | +60–150 ms | Paid on every request that omits language while normalize is on. Set the language to skip it. |
| Word timestamps | +0 ms audio latency | Alignments arrive ~50–200 ms after each audio chunk; the audio itself is never delayed. See Word timestamps. |
These are indicative figures for
kugel-3 on the production API, not a
guarantee — region, network path, and load all move them. Before optimizing
(or comparing vendors), measure your own deployment.The three factors
End-to-end latency decomposes into three parts; each has different levers.- Inference — the model itself. ~40–60 ms to first audio per text chunk. You don’t tune this directly; you avoid paying it more often than necessary (see chunking — every client-side flush forces a fresh model prefill).
- Processing — what happens to your text before inference. Language
auto-detection (+60–150 ms when
languageis unset) is the big one; normalization itself is fast. Output resampling to non-native sample rates costs ~0.1 ms per chunk — negligible. - Network — your RTT to the API, paid once per message exchange and several times during a connection handshake. Pick the closest region, and pre-connect so the handshake never lands in a user-visible request.
Levers
Pre-connect at startup
The single biggest fix. Without it, your first request pays the full WebSocket handshake; with it, the handshake happens at application startup where nobody is waiting.stream / streamingSession call reuses the pooled
connection. Connections are also reusable across turns — see
Turn lifecycle.
Set the language explicitly
Whenlanguage is unset and normalization is on, the server auto-detects the
language on every request (+60–150 ms). If you know the language, say so:
Let the server chunk; flush once per turn
Client-side per-sentence flushing forces a fresh model prefill per segment — the most common self-inflicted latency bug. Send tokens as they arrive, flush exactly once at the end of the turn. Full guidance: Chunking & per-segment latency and Turn lifecycle.Trade quality for per-segment speed
For real-time agents,optimize_streaming_latency (or an explicit
num_diffusion_steps) cuts per-segment generation time with a modest quality
trade-off. See Chunking & per-segment latency.
Pick the right region and sample rate
Use the region closest to your servers. Keep the native24000 Hz sample rate when you can; lower rates work fine (resampling is
~0.1 ms per chunk) but never make anything faster.
Measuring TTFA correctly
Time-to-first-audio is the metric that matters for voice agents. Measure it correctly or you’ll chase the wrong bottleneck.Pre-connect, then measure
Including the handshake in a TTFA measurement makes every other change look smaller than it is. Pre-connect first, start the clock after the connection is open:What to measure
Always report p50 and p95 over at least 20 warm requests, not single-shot numbers. TTFA has a long tail; the median lies, the p95 doesn’t.| Metric | What it tells you |
|---|---|
| Inference TTFA | Server-side only — useful for comparing model / voice / parameter changes against a fixed network. |
| End-to-end TTFA | What the user actually feels. Includes network RTT + connect cost (if not pre-warmed) + normalizer + first chunk. |
| p50 / p95 / p99 | Always over ≥ 20 warm requests; one-shot timings are meaningless. |
| Chunk-to-chunk gap | After the first chunk, how long between subsequent chunks. Spikes here mean the network or playback buffer can’t keep up, not the model. |
Reference benchmark
The Java SDK ships a complete TTFA bench you can run against any endpoint (cloud or self-hosted):TTFABench measures:
- Cold TTFA (first request, includes handshake) vs pooled TTFA (subsequent requests, connection reused) — quantifies what pre-connecting saves on your network.
- TTFA across chunking strategies (full-text, sentence, ≥20-char, clause, word) — the cost of small flushes.
- RTF on long-form text.
Common reporting mistakes
- Including the handshake in TTFA. Cold-start cost that has nothing to do with the model. Pre-connect first.
- Measuring against
localhost. No realistic network RTT. Numbers will be 30–80 ms lower than production. - Single-shot timings. Cold caches, GC pauses, JIT, scheduler jitter — p95 over 20+ warm requests or it’s noise.
- Mixing inference TTFA and end-to-end TTFA. Decide which one you’re reporting and label it. Comparing one to the other across vendors is how people end up with wrong “we’re slower than X” conclusions.
Next steps
Chunking & per-segment latency
Why per-sentence flushing destroys TTFA, and the knobs that tune chunking
Turn lifecycle
How turns start and end, session reuse, and the idle auto-flush