Documentation Index
Fetch the complete documentation index at: https://docs.kugelaudio.com/llms.txt
Use this file to discover all available pages before exploring further.
This page is the canonical reference for using KugelAudio’s streaming TTS
(/ws/tts/stream) with an LLM-driven voice agent. Read it before wiring up a
streaming integration; the patterns below avoid the most common — and
most expensive — mistakes we see in production.
The shape of the rule is the same in every SDK:
- One session per LLM turn. Keep the same
StreamingSession open for
the entire assistant turn. Don’t open a new session per sentence.
- Send LLM tokens directly, without
flush=true. The server already
accumulates text and starts generating at natural sentence boundaries.
Every client-side flush=true is a separate TTS request that pays the
full model time-to-first-audio (TTFA) again and produces an audible gap.
- Flush exactly once at the end of the turn. This emits any trailing
text that hasn’t yet crossed a sentence boundary, then closes the
session.
If you’re migrating from ElevenLabs: KugelAudio is more flush-sensitive
than ElevenLabs because each explicit flush triggers a fresh model prefill.
Code that flushes after every segment will work, but TTFA per segment will
be dramatically worse than it needs to be.
Pre-warm the connection at startup
The first request after process start pays a one-time ~300–500 ms
WebSocket handshake. If you let that land inside the first user turn,
every TTFA number you report (and every user’s first impression) is
inflated by that amount.
Call client.connect() at startup so the handshake happens before any
user is waiting:
from kugelaudio import KugelAudio
client = KugelAudio(api_key="...")
client.connect() # pay the ~500 ms handshake here, once
const client = new KugelAudio({ apiKey: '...' });
await client.connect(); // pay the ~500 ms handshake here, once
KugelAudio client = KugelAudio.createConnected(
KugelAudioOptions.builder("...").build()
); // connects synchronously before returning
After pre-connect every subsequent stream / streamingSession reuses
the pooled connection. See Measuring TTFA
below for how to quantify what this saves you.
Why this matters
/ws/tts/stream is one logical TTS request per session, regardless of how
many send calls you make. The server’s text buffer accumulates tokens and
hands a complete chunk to the model the moment it sees a natural boundary
(sentence punctuation, or the configured chunk_length_schedule threshold).
Inside a single session, model state (KV cache, voice conditioning) is
preserved across chunks so prosody stays natural.
Calling flush=true mid-turn breaks that flow: the server treats the flush
as a hard segment boundary, runs another full model prefill on whatever has
been buffered, and only then emits audio. The cost of that prefill is the
full model TTFA — the same cost you pay on the very first chunk of a
session. Do it on every word and you pay model TTFA on every word.
Pattern (correct)
Python
import asyncio
from openai import AsyncOpenAI
from kugelaudio import KugelAudio
openai = AsyncOpenAI()
kugel = KugelAudio(api_key="YOUR_API_KEY")
async def speak_turn(user_message: str) -> None:
llm = await openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": user_message}],
stream=True,
)
async with kugel.tts.streaming_session(
voice_id=1071,
model_id="kugel-2.5",
language="en",
) as session:
# Forward every LLM token directly. No flush=True per token,
# no client-side sentence buffering — the server handles that.
async for chunk in llm:
token = chunk.choices[0].delta.content
if not token:
continue
async for audio in session.send(token):
play_audio(audio.audio)
# Single flush at the end of the turn — emits any trailing
# text that hasn't yet crossed a sentence boundary.
async for audio in session.flush():
play_audio(audio.audio)
asyncio.run(speak_turn("Tell me a short story."))
JavaScript / TypeScript
import { KugelAudio } from 'kugelaudio';
import OpenAI from 'openai';
const openai = new OpenAI();
const kugel = new KugelAudio({ apiKey: 'YOUR_API_KEY' });
async function speakTurn(userMessage: string): Promise<void> {
const session = kugel.tts.streamingSession(
{ voiceId: 1071, modelId: 'kugel-2.5', language: 'en' },
{ onChunk: (chunk) => playAudio(chunk.audio) },
);
await session.connect();
const llm = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: userMessage }],
stream: true,
});
// Forward every LLM token directly. No flush per token —
// the server accumulates and chunks at sentence boundaries.
for await (const chunk of llm) {
const token = chunk.choices[0]?.delta?.content;
if (token) session.send(token);
}
// close() triggers the server-side final flush of trailing text,
// streams the resulting audio through onChunk, then closes the WS.
await session.close();
}
Java
StreamConfig config = StreamConfig.builder()
.voiceId(1071)
.modelId("kugel-2.5")
.language("en")
.build();
try (StreamingSession session = client.streamingSession(config, new StreamCallbacks() {
@Override
public void onChunk(AudioChunk chunk) {
playAudio(chunk.getAudio());
}
})) {
// Forward every LLM token directly — no client-side buffering.
for (String token : llmTokenStream) {
session.send(token); // flush=false (default)
}
// Single flush at turn end — emits any trailing text.
session.flush();
}
Chunk-size ordering — pick the largest you can
If you’re driving the session from a layer above raw LLM tokens (for
example, a translation pipeline that emits clauses, or a router that
batches output before sending), use the largest chunks you can. The
ordering, from best to worst time-to-first-audio per emitted segment, is:
| Chunk granularity | Verdict |
|---|
Full turn in one send | Best possible. Use when the full text is available before TTS starts. |
| Sentence-level chunks | Recommended for streamed LLM output. |
| ≥20-character chunks | Acceptable fallback when sentence boundaries aren’t yet available. |
| Clause-level (comma/semicolon) | Avoid. Each chunk pays model TTFA. |
| Word-level or sub-word | Don’t. Each chunk pays model TTFA — by far the most expensive shape. |
Two important nuances:
- Raw LLM tokens are fine as long as you
send them without
flush=true — the server’s text buffer reassembles them and only
hands sentence-sized work to the model. The “word-level is bad” row
above applies when you flush after each word, not when you
send one word at a time without flushing.
- We deliberately don’t publish exact ms figures here — they depend on
region, voice, model, and GPU. The ordering is stable; the absolute
numbers aren’t. If you want to reproduce the comparison for your
own deployment, run the Java benchmark
(
TTFABench.chunkingStrategyBench)
against your endpoint.
Tuning server-side auto-chunking
You rarely need this, but two StreamConfig parameters let you trade
prosody context for lower first-chunk latency, without any client-side
flushing:
| Parameter | Type | Default | Effect |
|---|
chunk_length_schedule | list[int] | [5, 80, 150, 250] | Minimum buffer size (chars) before each successive chunk auto-emits. |
auto_mode | bool | false | Start at the first clean sentence boundary (equivalent to ElevenLabs auto_mode=true). |
Use the defaults unless you’ve measured a problem.
Time-to-first-audio (TTFA) is the metric that matters for voice agents. Measure it correctly or you’ll chase the wrong bottleneck.
Pre-connect, then measure
The first request after process start includes the WebSocket handshake (~300–500 ms). Including that in your TTFA measurement makes every other change look smaller than it is. Always pre-connect at startup and start the clock after the connection is open.
import time
from kugelaudio import KugelAudio
client = KugelAudio(api_key="...")
client.connect() # pay the ~500 ms handshake here, once
assert client.is_connected()
start = time.perf_counter()
for chunk in client.tts.stream(
text="Hello from KugelAudio.",
model_id="kugel-2.5",
):
if hasattr(chunk, "audio"):
ttfa_ms = (time.perf_counter() - start) * 1000
print(f"TTFA: {ttfa_ms:.1f} ms")
break
import { KugelAudio } from 'kugelaudio';
const client = new KugelAudio({ apiKey: '...' });
await client.connect(); // pay the ~500 ms handshake here, once
const start = performance.now();
let first = true;
await client.tts.stream(
{ text: 'Hello from KugelAudio.', modelId: 'kugel-2.5' },
{
onChunk: () => {
if (first) {
console.log(`TTFA: ${(performance.now() - start).toFixed(1)} ms`);
first = false;
}
},
},
);
What to measure
Always report p50 and p95 over at least 20 warm requests, not single-shot numbers. TTFA has a long tail; the median lies, the p95 doesn’t.
| Metric | What it tells you |
|---|
| Inference TTFA | Server-side only — useful for comparing model / voice / parameter changes against a fixed network. |
| End-to-end TTFA | What the user actually feels. Includes network RTT + connect cost (if not pre-warmed) + normalizer + first chunk. |
| p50 / p95 / p99 | Always over ≥ 20 warm requests; one-shot timings are meaningless. |
| Chunk-to-chunk gap | After the first chunk, how long between subsequent chunks. Spikes here mean the network or playback buffer can’t keep up, not the model. |
Reference benchmark
The Java SDK ships a complete TTFA bench you can run against any endpoint (cloud or self-hosted):
cd sdks/java/benchmark
./gradlew run
TTFABench measures:
- Cold TTFA (first request, includes handshake) vs pooled TTFA (subsequent requests, connection reused) — quantifies what pre-connecting saves you.
- TTFA across chunking strategies (full-text, sentence, ≥20-char, clause, word) so you can see the cost of small flushes on your network.
- RTF on long-form text.
Run it from inside your VPC or your customer’s region to get numbers that match what you’ll ship.
Common TTFA reporting mistakes
- Including handshake in TTFA. ~300–500 ms of cold-start that has nothing to do with the model. Pre-connect first.
- Measuring against
localhost. No realistic network RTT. Numbers will be 30–80 ms lower than production.
- Single-shot timings. Cold caches, GC pauses, JIT, scheduler jitter — p95 over 20+ warm requests or it’s noise.
- Mixing inference TTFA and end-to-end TTFA. Decide which one you’re reporting and label it. Comparing one to the other across vendors is how people end up with wrong “we’re slower than X” conclusions.
Common mistakes
- Per-segment
flush=true. Every flush is a fresh TTS request that
pays the full model TTFA. If you flush after every sentence, you pay
it N times per turn instead of once.
- One session per sentence. A new WebSocket handshake (~200-300 ms)
plus a fresh model prefill, every sentence. Keep the same session open
for the whole assistant turn; only close it when the turn ends.
- Client-side sentence buffering before
send. Unnecessary — the
server already buffers tokens and chunks at sentence boundaries.
Pre-buffering on the client just adds latency.
- Calling
send(text, flush=true) per word “for lower latency.” It
is the opposite: each flush is a separate model call. Word-granular
flushing produces the worst possible TTFA.
Migrating from ElevenLabs
ElevenLabs’ text_chunker flushes on every internal trigger; the
WebSocket protocol is more forgiving of mid-stream flushes because each
flush is comparatively cheaper. KugelAudio’s /ws/tts/stream is not:
each flush triggers a fresh model prefill. The mechanical translation —
“flush=True on KugelAudio == flush=true on ElevenLabs” — is the
single most common source of bad TTFA when porting an existing
ElevenLabs integration.
The right translation:
| ElevenLabs pattern | KugelAudio equivalent |
|---|
send(text, flush=True) after every chunk | send(text) with no flush; let the server’s text buffer chunk. |
try_trigger_generation=True | Default behavior. The server starts generation at sentence boundaries automatically. |
auto_mode=true | Same name on KugelAudio (StreamConfig.auto_mode). |
| One context per turn | One StreamingSession per turn. |
A dedicated ElevenLabs → KugelAudio migration guide is tracked
separately; this section covers the streaming-protocol differences only.