Skip to main content
Traditional TTS generates the entire audio before returning it. Streaming returns audio chunks as they’re generated, providing:
  • Lower latency: First audio arrives in tens of milliseconds instead of waiting for full generation — see Latency for what to expect
  • Better UX: Users hear audio immediately while more is being generated
  • LLM integration: Process text token-by-token as it arrives from language models

The four rules

Streaming integrations live or die by these. Each links to the page that explains it in depth:
  1. One session per LLM turn. Keep the same streaming session open for the entire assistant turn — never one session per sentence. See Turn lifecycle.
  2. Send LLM tokens directly, without flushing. The server accumulates text and starts generating at natural sentence boundaries. Every client-side flush is a fresh model prefill. See Chunking & per-segment latency.
  3. Flush exactly once, at the end of the turn. This emits any trailing text, then ends the turn. See Turn lifecycle.
  4. Pre-connect at startup. Don’t pay the WebSocket handshake inside the first user interaction. See Latency.

Simple streaming

The simplest pattern — stream a complete text:
for chunk in client.tts.stream(
    text="Hello, this is streaming audio.",
    model_id="kugel-3",
):
    if hasattr(chunk, 'audio'):
        play_audio(chunk.audio)

LLM token streaming

Stream text token-by-token as it arrives from an LLM. Let the server handle chunking at sentence boundaries — do not flush on every sentence from the client.
async def stream_from_llm(llm_response):
    async with client.tts.streaming_session(
        voice_id=1071,
        model_id="kugel-3",
        auto_mode=True,                         # start at first sentence boundary
        chunk_length_schedule=[50, 100, 150, 250],  # low-latency schedule
    ) as session:
        async for token in llm_response:
            async for chunk in session.send(token):
                play_audio(chunk.audio)

        # Flush remaining text — ends the turn
        async for chunk in session.flush():
            play_audio(chunk.audio)
Do not flush on every sentence from the client. Calling send(token, flush=True) per sentence bypasses the server’s semantic chunking, forces a cold model prefill on every segment, and makes latency worse, not better. Use autoMode / chunkLengthSchedule and let the server decide boundaries — see Chunking & per-segment latency.

Complete agent turn

The full shape of one assistant turn, LLM to audio:
import asyncio
from openai import AsyncOpenAI
from kugelaudio import KugelAudio

openai = AsyncOpenAI()
kugel = KugelAudio(api_key="YOUR_API_KEY")

async def speak_turn(user_message: str) -> None:
    llm = await openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": user_message}],
        stream=True,
    )

    async with kugel.tts.streaming_session(
        voice_id=1071,
        model_id="kugel-3",
        language="en",
    ) as session:
        # Forward every LLM token directly. No flush=True per token,
        # no client-side sentence buffering — the server handles that.
        async for chunk in llm:
            token = chunk.choices[0].delta.content
            if not token:
                continue
            async for audio in session.send(token):
                play_audio(audio.audio)

        # Single flush at the end of the turn — emits any trailing
        # text that hasn't yet crossed a sentence boundary.
        async for audio in session.flush():
            play_audio(audio.audio)

asyncio.run(speak_turn("Tell me a short story."))

Spelling out text mid-stream

Use <spell> tags to spell out text letter by letter (requires normalize: true and an explicit language):
text = "Contact us at <spell>hello@kugelaudio.com</spell> for help."

for chunk in client.tts.stream(
    text=text,
    model_id="kugel-3",
    normalize=True,
    language="en",
):
    if hasattr(chunk, 'audio'):
        play_audio(chunk.audio)
When streaming token-by-token, spell tags that span multiple chunks are handled automatically: the server buffers text until the closing </spell> tag arrives before generating audio, and auto-closes incomplete tags if the stream ends unexpectedly. See Text processing for the full spell-tag reference.

Audio playback

import { decodePCM16 } from 'kugelaudio';

const audioContext = new AudioContext();
let scheduledTime = audioContext.currentTime;

function playChunk(chunk: AudioChunk) {
  const float32Data = decodePCM16(chunk.audio);

  const audioBuffer = audioContext.createBuffer(
    1, // mono
    float32Data.length,
    chunk.sampleRate
  );
  audioBuffer.copyToChannel(float32Data, 0);

  const source = audioContext.createBufferSource();
  source.buffer = audioBuffer;
  source.connect(audioContext.destination);

  // Schedule playback
  source.start(scheduledTime);
  scheduledTime += audioBuffer.duration;
}

Error handling

import websockets

async def robust_streaming():
    max_retries = 3

    for attempt in range(max_retries):
        try:
            async for chunk in client.tts.stream_async(
                text="Hello!",
                model_id="kugel-3",
            ):
                if hasattr(chunk, 'audio'):
                    play_audio(chunk.audio)
            break  # Success

        except websockets.ConnectionClosed as e:
            if attempt < max_retries - 1:
                print(f"Connection closed, retrying... ({attempt + 1}/{max_retries})")
                await asyncio.sleep(1)
            else:
                raise

        except Exception as e:
            print(f"Streaming error: {e}")
            raise

Going deeper

Turn lifecycle

How turns start and end — flush, idle auto-flush, session reuse, usage

Chunking & per-segment latency

Chunk-size ordering, tuning auto-chunking, backpressure

Barge-in

Cancel the current turn when the user interrupts

Multi-context streaming

Up to 20 independent audio streams over one connection

Word timestamps

Word-level time alignments alongside streaming audio

WebSocket API reference

The full wire format: every message type, field by field