Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kugelaudio.com/llms.txt

Use this file to discover all available pages before exploring further.

This page is the canonical reference for using KugelAudio’s streaming TTS (/ws/tts/stream) with an LLM-driven voice agent. Read it before wiring up a streaming integration; the patterns below avoid the most common — and most expensive — mistakes we see in production. The shape of the rule is the same in every SDK:
  1. One session per LLM turn. Keep the same StreamingSession open for the entire assistant turn. Don’t open a new session per sentence.
  2. Send LLM tokens directly, without flush=true. The server already accumulates text and starts generating at natural sentence boundaries. Every client-side flush=true is a separate TTS request that pays the full model time-to-first-audio (TTFA) again and produces an audible gap.
  3. Flush exactly once at the end of the turn. This emits any trailing text that hasn’t yet crossed a sentence boundary, then closes the session.
If you’re migrating from ElevenLabs: KugelAudio is more flush-sensitive than ElevenLabs because each explicit flush triggers a fresh model prefill. Code that flushes after every segment will work, but TTFA per segment will be dramatically worse than it needs to be.

Pre-warm the connection at startup

The first request after process start pays a one-time ~300–500 ms WebSocket handshake. If you let that land inside the first user turn, every TTFA number you report (and every user’s first impression) is inflated by that amount. Call client.connect() at startup so the handshake happens before any user is waiting:
from kugelaudio import KugelAudio

client = KugelAudio(api_key="...")
client.connect()   # pay the ~500 ms handshake here, once
const client = new KugelAudio({ apiKey: '...' });
await client.connect();   // pay the ~500 ms handshake here, once
KugelAudio client = KugelAudio.createConnected(
    KugelAudioOptions.builder("...").build()
);   // connects synchronously before returning
After pre-connect every subsequent stream / streamingSession reuses the pooled connection. See Measuring TTFA below for how to quantify what this saves you.

Why this matters

/ws/tts/stream is one logical TTS request per session, regardless of how many send calls you make. The server’s text buffer accumulates tokens and hands a complete chunk to the model the moment it sees a natural boundary (sentence punctuation, or the configured chunk_length_schedule threshold). Inside a single session, model state (KV cache, voice conditioning) is preserved across chunks so prosody stays natural. Calling flush=true mid-turn breaks that flow: the server treats the flush as a hard segment boundary, runs another full model prefill on whatever has been buffered, and only then emits audio. The cost of that prefill is the full model TTFA — the same cost you pay on the very first chunk of a session. Do it on every word and you pay model TTFA on every word.

Pattern (correct)

Python

import asyncio
from openai import AsyncOpenAI
from kugelaudio import KugelAudio

openai = AsyncOpenAI()
kugel = KugelAudio(api_key="YOUR_API_KEY")

async def speak_turn(user_message: str) -> None:
    llm = await openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": user_message}],
        stream=True,
    )

    async with kugel.tts.streaming_session(
        voice_id=1071,
        model_id="kugel-2.5",
        language="en",
    ) as session:
        # Forward every LLM token directly. No flush=True per token,
        # no client-side sentence buffering — the server handles that.
        async for chunk in llm:
            token = chunk.choices[0].delta.content
            if not token:
                continue
            async for audio in session.send(token):
                play_audio(audio.audio)

        # Single flush at the end of the turn — emits any trailing
        # text that hasn't yet crossed a sentence boundary.
        async for audio in session.flush():
            play_audio(audio.audio)

asyncio.run(speak_turn("Tell me a short story."))

JavaScript / TypeScript

import { KugelAudio } from 'kugelaudio';
import OpenAI from 'openai';

const openai = new OpenAI();
const kugel = new KugelAudio({ apiKey: 'YOUR_API_KEY' });

async function speakTurn(userMessage: string): Promise<void> {
  const session = kugel.tts.streamingSession(
    { voiceId: 1071, modelId: 'kugel-2.5', language: 'en' },
    { onChunk: (chunk) => playAudio(chunk.audio) },
  );
  await session.connect();

  const llm = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: userMessage }],
    stream: true,
  });

  // Forward every LLM token directly. No flush per token —
  // the server accumulates and chunks at sentence boundaries.
  for await (const chunk of llm) {
    const token = chunk.choices[0]?.delta?.content;
    if (token) session.send(token);
  }

  // close() triggers the server-side final flush of trailing text,
  // streams the resulting audio through onChunk, then closes the WS.
  await session.close();
}

Java

StreamConfig config = StreamConfig.builder()
    .voiceId(1071)
    .modelId("kugel-2.5")
    .language("en")
    .build();

try (StreamingSession session = client.streamingSession(config, new StreamCallbacks() {
    @Override
    public void onChunk(AudioChunk chunk) {
        playAudio(chunk.getAudio());
    }
})) {
    // Forward every LLM token directly — no client-side buffering.
    for (String token : llmTokenStream) {
        session.send(token);  // flush=false (default)
    }
    // Single flush at turn end — emits any trailing text.
    session.flush();
}

Chunk-size ordering — pick the largest you can

If you’re driving the session from a layer above raw LLM tokens (for example, a translation pipeline that emits clauses, or a router that batches output before sending), use the largest chunks you can. The ordering, from best to worst time-to-first-audio per emitted segment, is:
Chunk granularityVerdict
Full turn in one sendBest possible. Use when the full text is available before TTS starts.
Sentence-level chunksRecommended for streamed LLM output.
≥20-character chunksAcceptable fallback when sentence boundaries aren’t yet available.
Clause-level (comma/semicolon)Avoid. Each chunk pays model TTFA.
Word-level or sub-wordDon’t. Each chunk pays model TTFA — by far the most expensive shape.
Two important nuances:
  • Raw LLM tokens are fine as long as you send them without flush=true — the server’s text buffer reassembles them and only hands sentence-sized work to the model. The “word-level is bad” row above applies when you flush after each word, not when you send one word at a time without flushing.
  • We deliberately don’t publish exact ms figures here — they depend on region, voice, model, and GPU. The ordering is stable; the absolute numbers aren’t. If you want to reproduce the comparison for your own deployment, run the Java benchmark (TTFABench.chunkingStrategyBench) against your endpoint.

Tuning server-side auto-chunking

You rarely need this, but two StreamConfig parameters let you trade prosody context for lower first-chunk latency, without any client-side flushing:
ParameterTypeDefaultEffect
chunk_length_schedulelist[int][5, 80, 150, 250]Minimum buffer size (chars) before each successive chunk auto-emits.
auto_modeboolfalseStart at the first clean sentence boundary (equivalent to ElevenLabs auto_mode=true).
Use the defaults unless you’ve measured a problem.

Measuring TTFA (performance testing)

Time-to-first-audio (TTFA) is the metric that matters for voice agents. Measure it correctly or you’ll chase the wrong bottleneck.

Pre-connect, then measure

The first request after process start includes the WebSocket handshake (~300–500 ms). Including that in your TTFA measurement makes every other change look smaller than it is. Always pre-connect at startup and start the clock after the connection is open.
import time
from kugelaudio import KugelAudio

client = KugelAudio(api_key="...")
client.connect()                     # pay the ~500 ms handshake here, once
assert client.is_connected()

start = time.perf_counter()
for chunk in client.tts.stream(
    text="Hello from KugelAudio.",
    model_id="kugel-2.5",
):
    if hasattr(chunk, "audio"):
        ttfa_ms = (time.perf_counter() - start) * 1000
        print(f"TTFA: {ttfa_ms:.1f} ms")
        break
import { KugelAudio } from 'kugelaudio';

const client = new KugelAudio({ apiKey: '...' });
await client.connect();              // pay the ~500 ms handshake here, once

const start = performance.now();
let first = true;
await client.tts.stream(
  { text: 'Hello from KugelAudio.', modelId: 'kugel-2.5' },
  {
    onChunk: () => {
      if (first) {
        console.log(`TTFA: ${(performance.now() - start).toFixed(1)} ms`);
        first = false;
      }
    },
  },
);

What to measure

Always report p50 and p95 over at least 20 warm requests, not single-shot numbers. TTFA has a long tail; the median lies, the p95 doesn’t.
MetricWhat it tells you
Inference TTFAServer-side only — useful for comparing model / voice / parameter changes against a fixed network.
End-to-end TTFAWhat the user actually feels. Includes network RTT + connect cost (if not pre-warmed) + normalizer + first chunk.
p50 / p95 / p99Always over ≥ 20 warm requests; one-shot timings are meaningless.
Chunk-to-chunk gapAfter the first chunk, how long between subsequent chunks. Spikes here mean the network or playback buffer can’t keep up, not the model.

Reference benchmark

The Java SDK ships a complete TTFA bench you can run against any endpoint (cloud or self-hosted):
cd sdks/java/benchmark
./gradlew run
TTFABench measures:
  • Cold TTFA (first request, includes handshake) vs pooled TTFA (subsequent requests, connection reused) — quantifies what pre-connecting saves you.
  • TTFA across chunking strategies (full-text, sentence, ≥20-char, clause, word) so you can see the cost of small flushes on your network.
  • RTF on long-form text.
Run it from inside your VPC or your customer’s region to get numbers that match what you’ll ship.

Common TTFA reporting mistakes

  • Including handshake in TTFA. ~300–500 ms of cold-start that has nothing to do with the model. Pre-connect first.
  • Measuring against localhost. No realistic network RTT. Numbers will be 30–80 ms lower than production.
  • Single-shot timings. Cold caches, GC pauses, JIT, scheduler jitter — p95 over 20+ warm requests or it’s noise.
  • Mixing inference TTFA and end-to-end TTFA. Decide which one you’re reporting and label it. Comparing one to the other across vendors is how people end up with wrong “we’re slower than X” conclusions.

Common mistakes

  • Per-segment flush=true. Every flush is a fresh TTS request that pays the full model TTFA. If you flush after every sentence, you pay it N times per turn instead of once.
  • One session per sentence. A new WebSocket handshake (~200-300 ms) plus a fresh model prefill, every sentence. Keep the same session open for the whole assistant turn; only close it when the turn ends.
  • Client-side sentence buffering before send. Unnecessary — the server already buffers tokens and chunks at sentence boundaries. Pre-buffering on the client just adds latency.
  • Calling send(text, flush=true) per word “for lower latency.” It is the opposite: each flush is a separate model call. Word-granular flushing produces the worst possible TTFA.

Migrating from ElevenLabs

ElevenLabs’ text_chunker flushes on every internal trigger; the WebSocket protocol is more forgiving of mid-stream flushes because each flush is comparatively cheaper. KugelAudio’s /ws/tts/stream is not: each flush triggers a fresh model prefill. The mechanical translation — “flush=True on KugelAudio == flush=true on ElevenLabs” — is the single most common source of bad TTFA when porting an existing ElevenLabs integration. The right translation:
ElevenLabs patternKugelAudio equivalent
send(text, flush=True) after every chunksend(text) with no flush; let the server’s text buffer chunk.
try_trigger_generation=TrueDefault behavior. The server starts generation at sentence boundaries automatically.
auto_mode=trueSame name on KugelAudio (StreamConfig.auto_mode).
One context per turnOne StreamingSession per turn.
A dedicated ElevenLabs → KugelAudio migration guide is tracked separately; this section covers the streaming-protocol differences only.