Skip to main content
tl;dr — pre-connect at startup, set language explicitly, end every turn with flush, and let the server chunk your text. Done right, a warm request from the same region delivers first audio in ~100–150 ms end-to-end.
This page is the single home for KugelAudio latency numbers. Other pages link here instead of quoting figures, so when numbers change they change in one place.

What to expect

MeasurementTypical valueWhat it includes
Inference TTFA~40–60 msServer-side only: model time from text chunk to first audio chunk. The floor for any deployment.
Warm end-to-end TTFA~100–150 msWhat a same-region client sees on a pre-connected socket with language set: network RTT + normalization + inference + first chunk delivery. Co-located (in-cluster) clients see as little as ~60 ms.
Cold first requestwarm + ~100–500 msAdds the TCP + TLS + WebSocket handshake. The handshake costs several network round-trips, so it scales with your distance to the API: ~100–250 ms same-region, up to ~500 ms cross-continent. Pre-connect to take it off the hot path entirely.
Language auto-detection+60–150 msPaid on every request that omits language while normalize is on. Set the language to skip it.
Word timestamps+0 ms audio latencyAlignments arrive ~50–200 ms after each audio chunk; the audio itself is never delayed. See Word timestamps.
These are indicative figures for kugel-3 on the production API, not a guarantee — region, network path, and load all move them. Before optimizing (or comparing vendors), measure your own deployment.

The three factors

End-to-end latency decomposes into three parts; each has different levers.
  1. Inference — the model itself. ~40–60 ms to first audio per text chunk. You don’t tune this directly; you avoid paying it more often than necessary (see chunking — every client-side flush forces a fresh model prefill).
  2. Processing — what happens to your text before inference. Language auto-detection (+60–150 ms when language is unset) is the big one; normalization itself is fast. Output resampling to non-native sample rates costs ~0.1 ms per chunk — negligible.
  3. Network — your RTT to the API, paid once per message exchange and several times during a connection handshake. Pick the closest region, and pre-connect so the handshake never lands in a user-visible request.

Levers

Pre-connect at startup

The single biggest fix. Without it, your first request pays the full WebSocket handshake; with it, the handshake happens at application startup where nobody is waiting.
from kugelaudio import KugelAudio

client = KugelAudio(api_key="...")
client.connect()   # pay the handshake here, once
const client = new KugelAudio({ apiKey: '...' });
await client.connect();   // pay the handshake here, once
KugelAudio client = KugelAudio.createConnected(
    KugelAudioOptions.builder("...").build()
);   // connects synchronously before returning
After pre-connect, every stream / streamingSession call reuses the pooled connection. Connections are also reusable across turns — see Turn lifecycle.

Set the language explicitly

When language is unset and normalization is on, the server auto-detects the language on every request (+60–150 ms). If you know the language, say so:
client.tts.stream(text="Guten Tag!", model_id="kugel-3", language="de")

Let the server chunk; flush once per turn

Client-side per-sentence flushing forces a fresh model prefill per segment — the most common self-inflicted latency bug. Send tokens as they arrive, flush exactly once at the end of the turn. Full guidance: Chunking & per-segment latency and Turn lifecycle.

Trade quality for per-segment speed

For real-time agents, optimize_streaming_latency (or an explicit num_diffusion_steps) cuts per-segment generation time with a modest quality trade-off. See Chunking & per-segment latency.

Pick the right region and sample rate

Use the region closest to your servers. Keep the native 24000 Hz sample rate when you can; lower rates work fine (resampling is ~0.1 ms per chunk) but never make anything faster.

Measuring TTFA correctly

Time-to-first-audio is the metric that matters for voice agents. Measure it correctly or you’ll chase the wrong bottleneck.

Pre-connect, then measure

Including the handshake in a TTFA measurement makes every other change look smaller than it is. Pre-connect first, start the clock after the connection is open:
import time
from kugelaudio import KugelAudio

client = KugelAudio(api_key="...")
client.connect()                     # handshake paid here, not measured
assert client.is_connected()

start = time.perf_counter()
for chunk in client.tts.stream(
    text="Hello from KugelAudio.",
    model_id="kugel-3",
    language="en",
):
    if hasattr(chunk, "audio"):
        ttfa_ms = (time.perf_counter() - start) * 1000
        print(f"TTFA: {ttfa_ms:.1f} ms")
        break
import { KugelAudio } from 'kugelaudio';

const client = new KugelAudio({ apiKey: '...' });
await client.connect();              // handshake paid here, not measured

const start = performance.now();
let first = true;
await client.tts.stream(
  { text: 'Hello from KugelAudio.', modelId: 'kugel-3', language: 'en' },
  {
    onChunk: () => {
      if (first) {
        console.log(`TTFA: ${(performance.now() - start).toFixed(1)} ms`);
        first = false;
      }
    },
  },
);

What to measure

Always report p50 and p95 over at least 20 warm requests, not single-shot numbers. TTFA has a long tail; the median lies, the p95 doesn’t.
MetricWhat it tells you
Inference TTFAServer-side only — useful for comparing model / voice / parameter changes against a fixed network.
End-to-end TTFAWhat the user actually feels. Includes network RTT + connect cost (if not pre-warmed) + normalizer + first chunk.
p50 / p95 / p99Always over ≥ 20 warm requests; one-shot timings are meaningless.
Chunk-to-chunk gapAfter the first chunk, how long between subsequent chunks. Spikes here mean the network or playback buffer can’t keep up, not the model.

Reference benchmark

The Java SDK ships a complete TTFA bench you can run against any endpoint (cloud or self-hosted):
cd sdks/java/benchmark
./gradlew run
TTFABench measures:
  • Cold TTFA (first request, includes handshake) vs pooled TTFA (subsequent requests, connection reused) — quantifies what pre-connecting saves on your network.
  • TTFA across chunking strategies (full-text, sentence, ≥20-char, clause, word) — the cost of small flushes.
  • RTF on long-form text.
Run it from inside your VPC or your customer’s region to get numbers that match what you’ll ship.

Common reporting mistakes

  • Including the handshake in TTFA. Cold-start cost that has nothing to do with the model. Pre-connect first.
  • Measuring against localhost. No realistic network RTT. Numbers will be 30–80 ms lower than production.
  • Single-shot timings. Cold caches, GC pauses, JIT, scheduler jitter — p95 over 20+ warm requests or it’s noise.
  • Mixing inference TTFA and end-to-end TTFA. Decide which one you’re reporting and label it. Comparing one to the other across vendors is how people end up with wrong “we’re slower than X” conclusions.

Next steps

Chunking & per-segment latency

Why per-sentence flushing destroys TTFA, and the knobs that tune chunking

Turn lifecycle

How turns start and end, session reuse, and the idle auto-flush