Latency - KugelAudio

tl;dr — pre-connect at startup, set language explicitly, end every turn with flush, and let the server chunk your text. Done right, a warm request from the same region delivers first audio in ~100–150 ms end-to-end.

This page is the single home for KugelAudio latency numbers. Other pages link here instead of quoting figures, so when numbers change they change in one place.

What to expect

Measurement	Typical value	What it includes
Inference TTFA	~40–60 ms	Server-side only: model time from text chunk to first audio chunk. The floor for any deployment.
Warm end-to-end TTFA	~100–150 ms	What a same-region client sees on a pre-connected socket with `language` set: network RTT + normalization + inference + first chunk delivery. Co-located (in-cluster) clients see as little as ~60 ms.
Cold first request	warm + ~100–500 ms	Adds the TCP + TLS + WebSocket handshake. The handshake costs several network round-trips, so it scales with your distance to the API: ~100–250 ms same-region, up to ~500 ms cross-continent. Pre-connect to take it off the hot path entirely.
Language auto-detection	+60–150 ms	Paid on every request that omits `language` while `normalize` is on. Set the language to skip it.
Word timestamps	+0 ms audio latency	Alignments arrive ~50–200 ms after each audio chunk; the audio itself is never delayed. See Word timestamps.

These are indicative figures for kugel-3 on the production API, not a guarantee — region, network path, and load all move them. Before optimizing (or comparing vendors), measure your own deployment.

The three factors

End-to-end latency decomposes into three parts; each has different levers.

Inference — the model itself. ~40–60 ms to first audio per text chunk. You don’t tune this directly; you avoid paying it more often than necessary (see chunking — every client-side flush forces a fresh model prefill).
Processing — what happens to your text before inference. Language auto-detection (+60–150 ms when language is unset) is the big one; normalization itself is fast. Output resampling to non-native sample rates costs ~0.1 ms per chunk — negligible.
Network — your RTT to the API, paid once per message exchange and several times during a connection handshake. Pick the closest region, and pre-connect so the handshake never lands in a user-visible request.

Levers

Pre-connect at startup

The single biggest fix. Without it, your first request pays the full WebSocket handshake; with it, the handshake happens at application startup where nobody is waiting.

from kugelaudio import KugelAudio

client = KugelAudio(api_key="...")
client.connect()   # pay the handshake here, once

const client = new KugelAudio({ apiKey: '...' });
await client.connect();   // pay the handshake here, once

KugelAudio client = KugelAudio.createConnected(
    KugelAudioOptions.builder("...").build()
);   // connects synchronously before returning

After pre-connect, every stream / streamingSession call reuses the pooled connection. Connections are also reusable across turns — see Turn lifecycle.

Set the language explicitly

When language is unset and normalization is on, the server auto-detects the language on every request (+60–150 ms). If you know the language, say so:

client.tts.stream(text="Guten Tag!", model_id="kugel-3", language="de")

Let the server chunk; flush once per turn

Client-side per-sentence flushing forces a fresh model prefill per segment — the most common self-inflicted latency bug. Send tokens as they arrive, flush exactly once at the end of the turn. Full guidance: Chunking & per-segment latency and Turn lifecycle.

Trade quality for per-segment speed

For real-time agents, optimize_streaming_latency (or an explicit num_diffusion_steps) cuts per-segment generation time with a modest quality trade-off. See Chunking & per-segment latency.

Pick the right region and sample rate

Use the region closest to your servers. Keep the native 24000 Hz sample rate when you can; lower rates work fine (resampling is ~0.1 ms per chunk) but never make anything faster.

Measuring TTFA correctly

Time-to-first-audio is the metric that matters for voice agents. Measure it correctly or you’ll chase the wrong bottleneck.

Pre-connect, then measure

Including the handshake in a TTFA measurement makes every other change look smaller than it is. Pre-connect first, start the clock after the connection is open:

import time
from kugelaudio import KugelAudio

client = KugelAudio(api_key="...")
client.connect()                     # handshake paid here, not measured
assert client.is_connected()

start = time.perf_counter()
for chunk in client.tts.stream(
    text="Hello from KugelAudio.",
    model_id="kugel-3",
    language="en",
):
    if hasattr(chunk, "audio"):
        ttfa_ms = (time.perf_counter() - start) * 1000
        print(f"TTFA: {ttfa_ms:.1f} ms")
        break

import { KugelAudio } from 'kugelaudio';

const client = new KugelAudio({ apiKey: '...' });
await client.connect();              // handshake paid here, not measured

const start = performance.now();
let first = true;
await client.tts.stream(
  { text: 'Hello from KugelAudio.', modelId: 'kugel-3', language: 'en' },
  {
    onChunk: () => {
      if (first) {
        console.log(`TTFA: ${(performance.now() - start).toFixed(1)} ms`);
        first = false;
      }
    },
  },
);

What to measure

Always report p50 and p95 over at least 20 warm requests, not single-shot numbers. TTFA has a long tail; the median lies, the p95 doesn’t.

Metric	What it tells you
Inference TTFA	Server-side only — useful for comparing model / voice / parameter changes against a fixed network.
End-to-end TTFA	What the user actually feels. Includes network RTT + connect cost (if not pre-warmed) + normalizer + first chunk.
p50 / p95 / p99	Always over ≥ 20 warm requests; one-shot timings are meaningless.
Chunk-to-chunk gap	After the first chunk, how long between subsequent chunks. Spikes here mean the network or playback buffer can’t keep up, not the model.

Reference benchmark

The Java SDK ships a complete TTFA bench you can run against any endpoint (cloud or self-hosted):

cd sdks/java/benchmark
./gradlew run

TTFABench measures:

Cold TTFA (first request, includes handshake) vs pooled TTFA (subsequent requests, connection reused) — quantifies what pre-connecting saves on your network.
TTFA across chunking strategies (full-text, sentence, ≥20-char, clause, word) — the cost of small flushes.
RTF on long-form text.

Run it from inside your VPC or your customer’s region to get numbers that match what you’ll ship.

Common reporting mistakes

Including the handshake in TTFA. Cold-start cost that has nothing to do with the model. Pre-connect first.
Measuring against localhost. No realistic network RTT. Numbers will be 30–80 ms lower than production.
Single-shot timings. Cold caches, GC pauses, JIT, scheduler jitter — p95 over 20+ warm requests or it’s noise.
Mixing inference TTFA and end-to-end TTFA. Decide which one you’re reporting and label it. Comparing one to the other across vendors is how people end up with wrong “we’re slower than X” conclusions.

​What to expect

​The three factors

​Levers

​Pre-connect at startup

​Set the language explicitly

​Let the server chunk; flush once per turn

​Trade quality for per-segment speed

​Pick the right region and sample rate

​Measuring TTFA correctly

​Pre-connect, then measure

​What to measure

​Reference benchmark

​Common reporting mistakes

​Next steps

Chunking & per-segment latency

Turn lifecycle

What to expect

The three factors

Levers

Pre-connect at startup

Set the language explicitly

Let the server chunk; flush once per turn

Trade quality for per-segment speed

Pick the right region and sample rate

Measuring TTFA correctly

Pre-connect, then measure

What to measure

Reference benchmark

Common reporting mistakes

Next steps