Streaming overview - KugelAudio

Traditional TTS generates the entire audio before returning it. Streaming returns audio chunks as they’re generated, providing:

Lower latency: First audio arrives in tens of milliseconds instead of waiting for full generation — see Latency for what to expect
Better UX: Users hear audio immediately while more is being generated
LLM integration: Process text token-by-token as it arrives from language models

The four rules

Streaming integrations live or die by these. Each links to the page that explains it in depth:

One session per LLM turn. Keep the same streaming session open for the entire assistant turn — never one session per sentence. See Turn lifecycle.
Send LLM tokens directly, without flushing. The server accumulates text and starts generating at natural sentence boundaries. Every client-side flush is a fresh model prefill. See Chunking & per-segment latency.
Flush exactly once, at the end of the turn. This emits any trailing text, then ends the turn. See Turn lifecycle.
Pre-connect at startup. Don’t pay the WebSocket handshake inside the first user interaction. See Latency.

Simple streaming

The simplest pattern — stream a complete text:

Python
JavaScript
Java
cURL

for chunk in client.tts.stream(
    text="Hello, this is streaming audio.",
    model_id="kugel-3",
):
    if hasattr(chunk, 'audio'):
        play_audio(chunk.audio)

await client.tts.stream(
  { text: 'Hello, this is streaming audio.', modelId: 'kugel-3' },
  {
    onChunk: (chunk) => playAudio(chunk.audio),
  }
);

client.tts().stream(
    GenerateRequest.builder("Hello, this is streaming audio.")
        .modelId("kugel-3")
        .language("en")
        .build(),
    new StreamCallbacks() {
        @Override
        public void onChunk(AudioChunk chunk) {
            playAudio(chunk.getAudio());
        }
    }
);

# Stream audio and pipe to ffplay for real-time playback
curl -X POST https://api.kugelaudio.com/v1/tts/generate \
  -H "Authorization: Bearer $KUGELAUDIO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is streaming audio.",
    "model_id": "kugel-3"
  }' \
  --no-buffer | ffplay -f s16le -ar 24000 -ac 1 -nodisp -

# Or save to file
curl -X POST https://api.kugelaudio.com/v1/tts/generate \
  -H "Authorization: Bearer $KUGELAUDIO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is streaming audio.",
    "model_id": "kugel-3"
  }' \
  --output output.pcm

LLM token streaming

Stream text token-by-token as it arrives from an LLM. Let the server handle chunking at sentence boundaries — do not flush on every sentence from the client.

Python
JavaScript
Java
cURL

async def stream_from_llm(llm_response):
    async with client.tts.streaming_session(
        voice_id=1071,
        model_id="kugel-3",
        auto_mode=True,                         # start at first sentence boundary
        chunk_length_schedule=[50, 100, 150, 250],  # low-latency schedule
    ) as session:
        async for token in llm_response:
            async for chunk in session.send(token):
                play_audio(chunk.audio)

        # Flush remaining text — ends the turn
        async for chunk in session.flush():
            play_audio(chunk.audio)

const session = client.tts.streamingSession(
  {
    voiceId: 1071,
    modelId: 'kugel-3',
    autoMode: true,                         // start at first sentence boundary
    chunkLengthSchedule: [50, 100, 150, 250],  // low-latency schedule
  },
  {
    onChunk: (chunk) => playAudio(chunk.audio),
    onSessionClosed: (totalSecs) => console.log(`Done: ${totalSecs}s`),
  }
);

session.connect();

for await (const token of llmResponse) {
  session.send(token);
}

session.close();

try (StreamingSession session = client.streamingSession(
        StreamConfig.builder()
            .voiceId(1071)
            .modelId("kugel-3")
            .autoMode(true)
            .chunkLengthSchedule(List.of(50, 100, 150, 250))
            .language("en")
            .build())) {

    // Stream tokens as they arrive from your LLM
    for (String token : llmResponse) {
        session.send(token, false);
    }
    // Close flushes the remaining buffer automatically
}

Token-by-token LLM streaming requires a persistent WebSocket connection, which is not supported by cURL. Use an SDK for this pattern, or connect to the raw WebSocket API with a WebSocket client like websocat.

Do not flush on every sentence from the client. Calling send(token, flush=True) per sentence bypasses the server’s semantic chunking, forces a cold model prefill on every segment, and makes latency worse, not better. Use autoMode / chunkLengthSchedule and let the server decide boundaries — see Chunking & per-segment latency.

Complete agent turn

The full shape of one assistant turn, LLM to audio:

import asyncio
from openai import AsyncOpenAI
from kugelaudio import KugelAudio

openai = AsyncOpenAI()
kugel = KugelAudio(api_key="YOUR_API_KEY")

async def speak_turn(user_message: str) -> None:
    llm = await openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": user_message}],
        stream=True,
    )

    async with kugel.tts.streaming_session(
        voice_id=1071,
        model_id="kugel-3",
        language="en",
    ) as session:
        # Forward every LLM token directly. No flush=True per token,
        # no client-side sentence buffering — the server handles that.
        async for chunk in llm:
            token = chunk.choices[0].delta.content
            if not token:
                continue
            async for audio in session.send(token):
                play_audio(audio.audio)

        # Single flush at the end of the turn — emits any trailing
        # text that hasn't yet crossed a sentence boundary.
        async for audio in session.flush():
            play_audio(audio.audio)

asyncio.run(speak_turn("Tell me a short story."))

Spelling out text mid-stream

Use <spell> tags to spell out text letter by letter (requires normalize: true and an explicit language):

text = "Contact us at <spell>hello@kugelaudio.com</spell> for help."

for chunk in client.tts.stream(
    text=text,
    model_id="kugel-3",
    normalize=True,
    language="en",
):
    if hasattr(chunk, 'audio'):
        play_audio(chunk.audio)

When streaming token-by-token, spell tags that span multiple chunks are handled automatically: the server buffers text until the closing </spell> tag arrives before generating audio, and auto-closes incomplete tags if the stream ends unexpectedly. See Text processing for the full spell-tag reference.

Audio playback

import { decodePCM16 } from 'kugelaudio';

const audioContext = new AudioContext();
let scheduledTime = audioContext.currentTime;

function playChunk(chunk: AudioChunk) {
  const float32Data = decodePCM16(chunk.audio);

  const audioBuffer = audioContext.createBuffer(
    1, // mono
    float32Data.length,
    chunk.sampleRate
  );
  audioBuffer.copyToChannel(float32Data, 0);

  const source = audioContext.createBufferSource();
  source.buffer = audioBuffer;
  source.connect(audioContext.destination);

  // Schedule playback
  source.start(scheduledTime);
  scheduledTime += audioBuffer.duration;
}

Error handling

import websockets

async def robust_streaming():
    max_retries = 3

    for attempt in range(max_retries):
        try:
            async for chunk in client.tts.stream_async(
                text="Hello!",
                model_id="kugel-3",
            ):
                if hasattr(chunk, 'audio'):
                    play_audio(chunk.audio)
            break  # Success

        except websockets.ConnectionClosed as e:
            if attempt < max_retries - 1:
                print(f"Connection closed, retrying... ({attempt + 1}/{max_retries})")
                await asyncio.sleep(1)
            else:
                raise

        except Exception as e:
            print(f"Streaming error: {e}")
            raise

Going deeper

Turn lifecycle

How turns start and end — flush, idle auto-flush, session reuse, usage

Chunking & per-segment latency

Chunk-size ordering, tuning auto-chunking, backpressure

Barge-in

Cancel the current turn when the user interrupts

Multi-context streaming

Up to 20 independent audio streams over one connection

Word timestamps

Word-level time alignments alongside streaming audio

WebSocket API reference

The full wire format: every message type, field by field

​The four rules

​Simple streaming

​LLM token streaming

​Complete agent turn

​Spelling out text mid-stream

​Audio playback

​Error handling

​Going deeper

Turn lifecycle

Chunking & per-segment latency

Barge-in

Multi-context streaming

Word timestamps

WebSocket API reference

The four rules

Simple streaming

LLM token streaming

Complete agent turn

Spelling out text mid-stream

Audio playback

Error handling

Going deeper