- Lower latency: First audio arrives in tens of milliseconds instead of waiting for full generation — see Latency for what to expect
- Better UX: Users hear audio immediately while more is being generated
- LLM integration: Process text token-by-token as it arrives from language models
The four rules
Streaming integrations live or die by these. Each links to the page that explains it in depth:- One session per LLM turn. Keep the same streaming session open for the entire assistant turn — never one session per sentence. See Turn lifecycle.
- Send LLM tokens directly, without flushing. The server accumulates text and starts generating at natural sentence boundaries. Every client-side flush is a fresh model prefill. See Chunking & per-segment latency.
- Flush exactly once, at the end of the turn. This emits any trailing text, then ends the turn. See Turn lifecycle.
- Pre-connect at startup. Don’t pay the WebSocket handshake inside the first user interaction. See Latency.
Simple streaming
The simplest pattern — stream a complete text:- Python
- JavaScript
- Java
- cURL
LLM token streaming
Stream text token-by-token as it arrives from an LLM. Let the server handle chunking at sentence boundaries — do not flush on every sentence from the client.- Python
- JavaScript
- Java
- cURL
Complete agent turn
The full shape of one assistant turn, LLM to audio:Spelling out text mid-stream
Use<spell> tags to spell out text letter by letter (requires
normalize: true and an explicit language):
</spell>
tag arrives before generating audio, and auto-closes incomplete tags if the
stream ends unexpectedly. See
Text processing for the full spell-tag
reference.
Audio playback
Error handling
Going deeper
Turn lifecycle
How turns start and end — flush, idle auto-flush, session reuse, usage
Chunking & per-segment latency
Chunk-size ordering, tuning auto-chunking, backpressure
Barge-in
Cancel the current turn when the user interrupts
Multi-context streaming
Up to 20 independent audio streams over one connection
Word timestamps
Word-level time alignments alongside streaming audio
WebSocket API reference
The full wire format: every message type, field by field