Streaming best practices

This page is the canonical reference for using KugelAudio’s streaming TTS (/ws/tts/stream) with an LLM-driven voice agent. Read it before wiring up a streaming integration; the patterns below avoid the most common — and most expensive — mistakes we see in production. The shape of the rule is the same in every SDK:

One session per LLM turn. Keep the same StreamingSession open for the entire assistant turn. Don’t open a new session per sentence.
Send LLM tokens directly, without flush=true. The server already accumulates text and starts generating at natural sentence boundaries. Every client-side flush=true is a separate TTS request that pays the full model time-to-first-audio (TTFA) again and produces an audible gap.
Flush exactly once at the end of the turn. This emits any trailing text that hasn’t yet crossed a sentence boundary, then closes the session.

If you’re migrating from ElevenLabs: KugelAudio is more flush-sensitive than ElevenLabs because each explicit flush triggers a fresh model prefill. Code that flushes after every segment will work, but TTFA per segment will be dramatically worse than it needs to be.

Pre-warm the connection at startup

The first request after process start pays a one-time ~300–500 ms WebSocket handshake. If you let that land inside the first user turn, every TTFA number you report (and every user’s first impression) is inflated by that amount. Call client.connect() at startup so the handshake happens before any user is waiting:

from kugelaudio import KugelAudio

client = KugelAudio(api_key="...")
client.connect()   # pay the ~500 ms handshake here, once

const client = new KugelAudio({ apiKey: '...' });
await client.connect();   // pay the ~500 ms handshake here, once

KugelAudio client = KugelAudio.createConnected(
    KugelAudioOptions.builder("...").build()
);   // connects synchronously before returning

After pre-connect every subsequent stream / streamingSession reuses the pooled connection. See Measuring TTFA below for how to quantify what this saves you.

Why this matters

/ws/tts/stream is one logical TTS request per session, regardless of how many send calls you make. The server’s text buffer accumulates tokens and hands a complete chunk to the model the moment it sees a natural boundary (sentence punctuation, or the configured chunk_length_schedule threshold). Inside a single session, model state (KV cache, voice conditioning) is preserved across chunks so prosody stays natural. Calling flush=true mid-turn breaks that flow: the server treats the flush as a hard segment boundary, runs another full model prefill on whatever has been buffered, and only then emits audio. The cost of that prefill is the full model TTFA — the same cost you pay on the very first chunk of a session. Do it on every word and you pay model TTFA on every word.

Pattern (correct)

Python

import asyncio
from openai import AsyncOpenAI
from kugelaudio import KugelAudio

openai = AsyncOpenAI()
kugel = KugelAudio(api_key="YOUR_API_KEY")

async def speak_turn(user_message: str) -> None:
    llm = await openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": user_message}],
        stream=True,
    )

    async with kugel.tts.streaming_session(
        voice_id=1071,
        model_id="kugel-2.5",
        language="en",
    ) as session:
        # Forward every LLM token directly. No flush=True per token,
        # no client-side sentence buffering — the server handles that.
        async for chunk in llm:
            token = chunk.choices[0].delta.content
            if not token:
                continue
            async for audio in session.send(token):
                play_audio(audio.audio)

        # Single flush at the end of the turn — emits any trailing
        # text that hasn't yet crossed a sentence boundary.
        async for audio in session.flush():
            play_audio(audio.audio)

asyncio.run(speak_turn("Tell me a short story."))

JavaScript / TypeScript

import { KugelAudio } from 'kugelaudio';
import OpenAI from 'openai';

const openai = new OpenAI();
const kugel = new KugelAudio({ apiKey: 'YOUR_API_KEY' });

async function speakTurn(userMessage: string): Promise<void> {
  const session = kugel.tts.streamingSession(
    { voiceId: 1071, modelId: 'kugel-2.5', language: 'en' },
    { onChunk: (chunk) => playAudio(chunk.audio) },
  );
  await session.connect();

  const llm = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: userMessage }],
    stream: true,
  });

  // Forward every LLM token directly. No flush per token —
  // the server accumulates and chunks at sentence boundaries.
  for await (const chunk of llm) {
    const token = chunk.choices[0]?.delta?.content;
    if (token) session.send(token);
  }

  // close() triggers the server-side final flush of trailing text,
  // streams the resulting audio through onChunk, then closes the WS.
  await session.close();
}

Java

StreamConfig config = StreamConfig.builder()
    .voiceId(1071)
    .modelId("kugel-2.5")
    .language("en")
    .build();

try (StreamingSession session = client.streamingSession(config, new StreamCallbacks() {
    @Override
    public void onChunk(AudioChunk chunk) {
        playAudio(chunk.getAudio());
    }
})) {
    // Forward every LLM token directly — no client-side buffering.
    for (String token : llmTokenStream) {
        session.send(token);  // flush=false (default)
    }
    // Single flush at turn end — emits any trailing text.
    session.flush();
}

Chunk-size ordering — pick the largest you can

If you’re driving the session from a layer above raw LLM tokens (for example, a translation pipeline that emits clauses, or a router that batches output before sending), use the largest chunks you can. The ordering, from best to worst time-to-first-audio per emitted segment, is:

Chunk granularity	Verdict
Full turn in one `send`	Best possible. Use when the full text is available before TTS starts.
Sentence-level chunks	Recommended for streamed LLM output.
≥20-character chunks	Acceptable fallback when sentence boundaries aren’t yet available.
Clause-level (comma/semicolon)	Avoid. Each chunk pays model TTFA.
Word-level or sub-word	Don’t. Each chunk pays model TTFA — by far the most expensive shape.

Two important nuances:

Raw LLM tokens are fine as long as you send them without flush=true — the server’s text buffer reassembles them and only hands sentence-sized work to the model. The “word-level is bad” row above applies when you flush after each word, not when you send one word at a time without flushing.
We deliberately don’t publish exact ms figures here — they depend on region, voice, model, and GPU. The ordering is stable; the absolute numbers aren’t. If you want to reproduce the comparison for your own deployment, run the Java benchmark (TTFABench.chunkingStrategyBench) against your endpoint.

Tuning server-side auto-chunking

You rarely need this, but two StreamConfig parameters let you trade prosody context for lower first-chunk latency, without any client-side flushing:

Parameter	Type	Default	Effect
`chunk_length_schedule`	`list[int]`	`[5, 80, 150, 250]`	Minimum buffer size (chars) before each successive chunk auto-emits.
`auto_mode`	`bool`	`false`	Start at the first clean sentence boundary (equivalent to ElevenLabs `auto_mode=true`).

Use the defaults unless you’ve measured a problem.

Measuring TTFA (performance testing)

Time-to-first-audio (TTFA) is the metric that matters for voice agents. Measure it correctly or you’ll chase the wrong bottleneck.

Pre-connect, then measure

The first request after process start includes the WebSocket handshake (~300–500 ms). Including that in your TTFA measurement makes every other change look smaller than it is. Always pre-connect at startup and start the clock after the connection is open.

import time
from kugelaudio import KugelAudio

client = KugelAudio(api_key="...")
client.connect()                     # pay the ~500 ms handshake here, once
assert client.is_connected()

start = time.perf_counter()
for chunk in client.tts.stream(
    text="Hello from KugelAudio.",
    model_id="kugel-2.5",
):
    if hasattr(chunk, "audio"):
        ttfa_ms = (time.perf_counter() - start) * 1000
        print(f"TTFA: {ttfa_ms:.1f} ms")
        break

import { KugelAudio } from 'kugelaudio';

const client = new KugelAudio({ apiKey: '...' });
await client.connect();              // pay the ~500 ms handshake here, once

const start = performance.now();
let first = true;
await client.tts.stream(
  { text: 'Hello from KugelAudio.', modelId: 'kugel-2.5' },
  {
    onChunk: () => {
      if (first) {
        console.log(`TTFA: ${(performance.now() - start).toFixed(1)} ms`);
        first = false;
      }
    },
  },
);

What to measure

Always report p50 and p95 over at least 20 warm requests, not single-shot numbers. TTFA has a long tail; the median lies, the p95 doesn’t.

Metric	What it tells you
Inference TTFA	Server-side only — useful for comparing model / voice / parameter changes against a fixed network.
End-to-end TTFA	What the user actually feels. Includes network RTT + connect cost (if not pre-warmed) + normalizer + first chunk.
p50 / p95 / p99	Always over ≥ 20 warm requests; one-shot timings are meaningless.
Chunk-to-chunk gap	After the first chunk, how long between subsequent chunks. Spikes here mean the network or playback buffer can’t keep up, not the model.

Reference benchmark

The Java SDK ships a complete TTFA bench you can run against any endpoint (cloud or self-hosted):

cd sdks/java/benchmark
./gradlew run

TTFABench measures:

Cold TTFA (first request, includes handshake) vs pooled TTFA (subsequent requests, connection reused) — quantifies what pre-connecting saves you.
TTFA across chunking strategies (full-text, sentence, ≥20-char, clause, word) so you can see the cost of small flushes on your network.
RTF on long-form text.

Run it from inside your VPC or your customer’s region to get numbers that match what you’ll ship.

Common TTFA reporting mistakes

Including handshake in TTFA. ~300–500 ms of cold-start that has nothing to do with the model. Pre-connect first.
Measuring against localhost. No realistic network RTT. Numbers will be 30–80 ms lower than production.
Single-shot timings. Cold caches, GC pauses, JIT, scheduler jitter — p95 over 20+ warm requests or it’s noise.
Mixing inference TTFA and end-to-end TTFA. Decide which one you’re reporting and label it. Comparing one to the other across vendors is how people end up with wrong “we’re slower than X” conclusions.

Common mistakes

Per-segment flush=true. Every flush is a fresh TTS request that pays the full model TTFA. If you flush after every sentence, you pay it N times per turn instead of once.
One session per sentence. A new WebSocket handshake (~200-300 ms) plus a fresh model prefill, every sentence. Keep the same session open for the whole assistant turn; only close it when the turn ends.
Client-side sentence buffering before send. Unnecessary — the server already buffers tokens and chunks at sentence boundaries. Pre-buffering on the client just adds latency.
Calling send(text, flush=true) per word “for lower latency.” It is the opposite: each flush is a separate model call. Word-granular flushing produces the worst possible TTFA.

Migrating from ElevenLabs

ElevenLabs’ text_chunker flushes on every internal trigger; the WebSocket protocol is more forgiving of mid-stream flushes because each flush is comparatively cheaper. KugelAudio’s /ws/tts/stream is not: each flush triggers a fresh model prefill. The mechanical translation — “flush=True on KugelAudio == flush=true on ElevenLabs” — is the single most common source of bad TTFA when porting an existing ElevenLabs integration. The right translation:

ElevenLabs pattern	KugelAudio equivalent
`send(text, flush=True)` after every chunk	`send(text)` with no flush; let the server’s text buffer chunk.
`try_trigger_generation=True`	Default behavior. The server starts generation at sentence boundaries automatically.
`auto_mode=true`	Same name on KugelAudio (`StreamConfig.auto_mode`).
One context per turn	One `StreamingSession` per turn.

A dedicated ElevenLabs → KugelAudio migration guide is tracked separately; this section covers the streaming-protocol differences only.

Getting Started

Speech Generation

Voices

Integrations

Deployment

SDK Reference

Streaming best practices

Pre-warm the connection at startup

Why this matters

Pattern (correct)

Python

JavaScript / TypeScript

Java

Chunk-size ordering — pick the largest you can

Tuning server-side auto-chunking

Measuring TTFA (performance testing)

Pre-connect, then measure

What to measure

Reference benchmark

Common TTFA reporting mistakes

Common mistakes

Migrating from ElevenLabs

Getting Started

Speech Generation

Voices

Integrations

Deployment

SDK Reference

Documentation Index

​Pre-warm the connection at startup

​Why this matters

​Pattern (correct)

​Python

​JavaScript / TypeScript

​Java

​Chunk-size ordering — pick the largest you can

​Tuning server-side auto-chunking

​Measuring TTFA (performance testing)

​Pre-connect, then measure

​What to measure

​Reference benchmark

​Common TTFA reporting mistakes

​Common mistakes

​Migrating from ElevenLabs

Pre-warm the connection at startup

Why this matters

Pattern (correct)

Python

JavaScript / TypeScript

Java

Chunk-size ordering — pick the largest you can

Tuning server-side auto-chunking

Measuring TTFA (performance testing)

Pre-connect, then measure

What to measure

Reference benchmark

Common TTFA reporting mistakes

Common mistakes

Migrating from ElevenLabs