Streaming Audio
Receive audio chunks as they are generated for lower latency:
# Synchronous streaming
for item in client.tts.stream(
text="Hello, this is streaming audio.",
model_id="kugel-3",
):
if hasattr(item, 'audio'): # AudioChunk
# Process audio chunk immediately
print(f"Chunk {item.index}: {len(item.audio)} bytes, {item.samples} samples")
# play_audio(item.audio)
elif isinstance(item, dict) and item.get('final'):
# Final stats
print(f"Total duration: {item.get('dur_ms', 0):.0f}ms")
print(f"Generation time: {item.get('gen_ms', 0):.0f}ms")
Async Streaming
For async applications:
import asyncio
async def generate_speech():
async for item in client.tts.stream_async(
text="Async streaming example.",
model_id="kugel-3",
):
if hasattr(item, 'audio'):
# Process chunk
pass
asyncio.run(generate_speech())
LLM Integration: Streaming Sessions
For real-time TTS when streaming text from an LLM (like GPT-4, Claude, etc.):
Async Streaming Session
import asyncio
async def stream_from_llm():
# Simulate LLM token stream
llm_tokens = ["Hello, ", "this ", "is ", "a ", "streamed ", "response."]
async with client.tts.streaming_session(
voice_id=1071,
cfg_scale=2.0,
flush_timeout_ms=500, # Auto-flush after 500ms of no input
) as session:
# Send tokens as they arrive from LLM
for token in llm_tokens:
async for chunk in session.send(token):
# Play audio chunk immediately
play_audio(chunk.audio)
# Flush any remaining text
async for chunk in session.flush():
play_audio(chunk.audio)
asyncio.run(stream_from_llm())
Synchronous Streaming Session
with client.tts.streaming_session_sync(voice_id=1071) as session:
for token in llm_tokens:
for chunk in session.send(token):
play_audio(chunk.audio)
for chunk in session.flush():
play_audio(chunk.audio)
Session Reuse
End a session without closing the WebSocket to avoid reconnection overhead when starting a new session (see Turn lifecycle):
session = await client.tts.streaming_session(voice_id=1071)
# Session 1
async for chunk in session.send("Hello from voice one."):
play_audio(chunk.audio)
await session.end_session() # Keeps WebSocket open
# Session 2 — no reconnection needed
session.update_config(voice_id=1072)
async for chunk in session.send("Hello from voice two."):
play_audio(chunk.audio)
await session.close() # Closes session + WebSocket
Barge-in (interrupt the current turn)
When the end user speaks over the agent, call cancel_current() to stop
generating the current turn immediately and drop any buffered/queued text —
without closing the WebSocket. Unlike end_session(), no remaining text
is flushed; the turn is abandoned. The socket stays open so the next
send() starts the next turn right away.
session = await client.tts.streaming_session(voice_id=1071)
async for chunk in session.send("This is a very long answer the user talks over"):
play_audio(chunk.audio)
# VAD detected the user speaking — barge in:
await session.cancel_current()
# Socket still open — next turn starts immediately:
async for chunk in session.send("Sure, what would you like instead?", flush=True):
play_audio(chunk.audio)
cancel_current() returns once the server acknowledges, or after a short quiet
timeout if the server goes silent. Stop local playback as soon as you call it —
a few in-flight frames may arrive before the acknowledgement. See
Barge-in for the
full protocol. The synchronous wrapper exposes cancel_current() too.
Streaming session reference
A session is created with streaming_session(...) (async) or
streaming_session_sync(...) (sync). Both accept the same configuration:
voice_id, model_id, cfg_scale, temperature, max_new_tokens,
sample_rate, flush_timeout_ms, normalize, language, word_timestamps,
speed, and an on_word_timestamps callback.
The async StreamingSession exposes:
| Method | Returns | Description |
|---|
await session.connect() | None | Open and authenticate the WebSocket. Called automatically by the first send() and by async with. |
session.send(text, flush=False) | AsyncIterator[AudioChunk] | Buffer text and yield audio as it is generated. flush=True forces synthesis of whatever is buffered. |
session.flush() | AsyncIterator[AudioChunk] | Flush the buffer and yield remaining audio for the current turn. |
session.drain() | AsyncIterator[AudioChunk] | Signal end-of-input and yield every remaining chunk until the server goes idle. |
await session.end_session() | dict | End the current turn (flushing remaining text) but keep the WebSocket open for reuse. |
await session.cancel_current() | None | Barge-in: abandon the current turn and drop buffered/queued text, keeping the socket open. |
session.update_config(config=None, **kwargs) | None | Update configuration (e.g. voice_id) for the next session after end_session(). |
await session.close() | dict | Close the session and the WebSocket. |
session.last_word_timestamps | list[WordTimestamp] | The most recently received word timestamps. |
session.last_final | dict | None | End-of-audio stats from the most recently completed turn — the server’s {"final": true, ...} frame (ElevenLabs isFinal equivalent), sent after the turn’s last audio frame. None before the first turn completes; not updated on a barge-in cancel. |
session.last_usage | SessionUsage | None | Per-session usage (audio time + amount charged) from the most recently closed session, for billing your own customers per conversation. None before the first session closes. See SessionUsage. |
StreamingSessionSync mirrors the async API without await/async for:
send(), flush(), and drain() return list[AudioChunk]; cancel_current(),
close(), and the last_word_timestamps / last_final / last_usage
properties behave the same.
Tuning streaming latency
By default the server accumulates LLM tokens and only begins generating at
natural sentence boundaries. Tune how eagerly it starts with these
session parameters:
| Parameter | Type | Default | Description |
|---|
flush_timeout_ms | int | 500 | Server-side auto-flush timeout — emit buffered text after this many milliseconds of no new input. |
chunk_length_schedule | list[int] | None | server default [5, 80, 150, 250] | Minimum buffer size (characters) before each successive auto-chunk is emitted. Entry i applies to chunk i; the last value repeats. Smaller values lower time-to-first-audio; larger values improve prosody. |
auto_mode | bool | None | None | Start generating at the very first clean sentence boundary (equivalent to ElevenLabs’ auto_mode). Lowest TTFA, slightly less prosody context. |
max_buffer_length | int | 1000 | Maximum characters buffered before a forced flush. |
dictionary_ids | list[int] | None | None | Per-session dictionary selection, applied to every turn. None = all active project dictionaries (language-filtered); [] = none; a list = exactly those (including inactive ones), bypassing the language filter. |
chunk_length_schedule, auto_mode, and max_buffer_length are set by
constructing a StreamConfig and passing it where a config is
accepted, or via session.update_config(...):
from kugelaudio.models import StreamConfig
session = await client.tts.streaming_session(voice_id=1071)
session.update_config(StreamConfig(
voice_id=1071,
auto_mode=True,
chunk_length_schedule=[50, 100, 150, 250], # low-latency schedule
))
Multi-Context Sessions
A multi-context session manages up to 20 independent audio-generation
contexts over a single WebSocket (see
limits). Each context has its own text buffer,
voice settings, and generation queue — useful for multi-speaker
conversations, pre-buffering one stream while another plays, or interleaving
audio for dynamic dialogue.
async with client.tts.multi_context_session(language="en") as session:
# Create contexts, optionally with different voices
await session.create_context("narrator", voice_id=1071)
await session.create_context("character", voice_id=1072)
# Send text to a specific context
async for chunk in session.send("narrator", "The story begins."):
play_audio(chunk.audio)
async for chunk in session.send("character", "Hello there!", flush=True):
play_audio(chunk.audio)
# Drain remaining audio and close one context
async for chunk in session.close_context("narrator"):
play_audio(chunk.audio)
Create the session with multi_context_session(...):
| Parameter | Type | Default | Description |
|---|
default_voice_id | int | None | None | Default voice for contexts that don’t override it. |
model_id | str | None | None | Model to use. |
sample_rate | int | 24000 | Output sample rate. |
output_format | str | None | None | Combined codec + rate token (pcm_8000, pcm_16000, pcm_22050, pcm_24000, ulaw_8000, alaw_8000). |
cfg_scale | float | 2.0 | Guidance scale. |
temperature | float | None | None | Sampling variance. |
max_new_tokens | int | 2048 | Maximum tokens per generation. |
normalize | bool | True | Enable text normalization. |
language | str | None | None | Normalization language. |
inactivity_timeout | float | 20.0 | Seconds before an idle context auto-closes. |
MultiContextSession methods:
| Method | Returns | Description |
|---|
await session.connect() | None | Open the WebSocket. Called automatically by async with. |
await session.create_context(context_id, voice_id=None) | None | Create a context with an optional voice override. |
session.send(context_id, text, flush=False, chunk_complete_idle_timeout=None) | AsyncIterator[AudioChunk] | Send text to a context and yield its audio. |
session.flush(context_id) | AsyncIterator[AudioChunk] | Flush a context’s buffer. |
session.close_context(context_id, immediate=False) | AsyncIterator[AudioChunk] | Close a context and drain its audio. immediate=True barges in, discarding buffered/queued text. |
await session.keep_alive(context_id) | None | Reset a context’s inactivity timeout. |
await session.close() | dict | Close the session and return stats. |
session.get_word_timestamps(context_id) | list[WordTimestamp] | Latest word timestamps for a context. |
session.usage_for(context_id) | SessionUsage | None | Per-context usage (audio time + amount charged) for a closed context — each context is its own conversation. None until that context closes. See SessionUsage. |
session.context_usage | dict[str, SessionUsage] | Map of context_id → usage for every context closed so far. |
session.active_contexts | set[str] | The set of currently active context IDs. |
session.session_id | str | None | Server-assigned session ID. |
session.is_alive | bool | Whether the underlying WebSocket is still usable for send(). |
Pass on_word_timestamps=callback to multi_context_session(...) to receive
(context_id, list[WordTimestamp]) as timestamps arrive.
Word Timestamps in Streaming
Word timestamps work with all streaming methods. During streaming, they are yielded as list[WordTimestamp] objects between audio chunks:
from kugelaudio.models import WordTimestamp
for item in client.tts.stream(
text="Hello, how are you today?",
model_id="kugel-3",
word_timestamps=True,
):
if hasattr(item, 'audio'): # AudioChunk
play_audio(item.audio)
elif isinstance(item, list) and item and isinstance(item[0], WordTimestamp):
for ts in item:
print(f"{ts.word}: {ts.start_ms}-{ts.end_ms}ms")
Word Timestamps in Streaming Sessions
Request word-level time alignments alongside audio. Timestamps are delivered per chunk after the corresponding audio data:
async with client.tts.streaming_session(
voice_id=1071,
word_timestamps=True,
) as session:
async for chunk in session.send("Hello, how are you today?"):
play_audio(chunk.audio)
async for chunk in session.flush():
play_audio(chunk.audio)
# Access the latest word timestamps
timestamps = session.last_word_timestamps
for ts in timestamps:
print(f"{ts.word}: {ts.start_ms}ms - {ts.end_ms}ms (score: {ts.score:.2f})")
You can also register a callback to process timestamps as they arrive:
def on_timestamps(timestamps):
for ts in timestamps:
print(f" {ts.word} [{ts.start_ms}-{ts.end_ms}ms]")
async with client.tts.streaming_session(
voice_id=1071,
on_word_timestamps=on_timestamps,
) as session:
async for chunk in session.send("Hello world!"):
play_audio(chunk.audio)
async for chunk in session.flush():
play_audio(chunk.audio)
Word timestamps add no extra audio latency. They arrive shortly after the corresponding audio chunk (see Latency) and are useful for barge-in handling, subtitle synchronization, and lip-sync.
Next steps