Skip to main content

Streaming Audio

Receive audio chunks as they are generated for lower latency:
# Synchronous streaming
for item in client.tts.stream(
    text="Hello, this is streaming audio.",
    model_id="kugel-3",
):
    if hasattr(item, 'audio'):  # AudioChunk
        # Process audio chunk immediately
        print(f"Chunk {item.index}: {len(item.audio)} bytes, {item.samples} samples")
        # play_audio(item.audio)
    elif isinstance(item, dict) and item.get('final'):
        # Final stats
        print(f"Total duration: {item.get('dur_ms', 0):.0f}ms")
        print(f"Generation time: {item.get('gen_ms', 0):.0f}ms")

Async Streaming

For async applications:
import asyncio

async def generate_speech():
    async for item in client.tts.stream_async(
        text="Async streaming example.",
        model_id="kugel-3",
    ):
        if hasattr(item, 'audio'):
            # Process chunk
            pass

asyncio.run(generate_speech())

LLM Integration: Streaming Sessions

For real-time TTS when streaming text from an LLM (like GPT-4, Claude, etc.):

Async Streaming Session

import asyncio

async def stream_from_llm():
    # Simulate LLM token stream
    llm_tokens = ["Hello, ", "this ", "is ", "a ", "streamed ", "response."]
    
    async with client.tts.streaming_session(
        voice_id=1071,
        cfg_scale=2.0,
        flush_timeout_ms=500,  # Auto-flush after 500ms of no input
    ) as session:
        # Send tokens as they arrive from LLM
        for token in llm_tokens:
            async for chunk in session.send(token):
                # Play audio chunk immediately
                play_audio(chunk.audio)
        
        # Flush any remaining text
        async for chunk in session.flush():
            play_audio(chunk.audio)

asyncio.run(stream_from_llm())

Synchronous Streaming Session

with client.tts.streaming_session_sync(voice_id=1071) as session:
    for token in llm_tokens:
        for chunk in session.send(token):
            play_audio(chunk.audio)
    
    for chunk in session.flush():
        play_audio(chunk.audio)

Session Reuse

End a session without closing the WebSocket to avoid reconnection overhead when starting a new session (see Turn lifecycle):
session = await client.tts.streaming_session(voice_id=1071)

# Session 1
async for chunk in session.send("Hello from voice one."):
    play_audio(chunk.audio)
await session.end_session()  # Keeps WebSocket open

# Session 2 — no reconnection needed
session.update_config(voice_id=1072)
async for chunk in session.send("Hello from voice two."):
    play_audio(chunk.audio)

await session.close()  # Closes session + WebSocket

Barge-in (interrupt the current turn)

When the end user speaks over the agent, call cancel_current() to stop generating the current turn immediately and drop any buffered/queued text — without closing the WebSocket. Unlike end_session(), no remaining text is flushed; the turn is abandoned. The socket stays open so the next send() starts the next turn right away.
session = await client.tts.streaming_session(voice_id=1071)

async for chunk in session.send("This is a very long answer the user talks over"):
    play_audio(chunk.audio)

# VAD detected the user speaking — barge in:
await session.cancel_current()

# Socket still open — next turn starts immediately:
async for chunk in session.send("Sure, what would you like instead?", flush=True):
    play_audio(chunk.audio)
cancel_current() returns once the server acknowledges, or after a short quiet timeout if the server goes silent. Stop local playback as soon as you call it — a few in-flight frames may arrive before the acknowledgement. See Barge-in for the full protocol. The synchronous wrapper exposes cancel_current() too.

Streaming session reference

A session is created with streaming_session(...) (async) or streaming_session_sync(...) (sync). Both accept the same configuration: voice_id, model_id, cfg_scale, temperature, max_new_tokens, sample_rate, flush_timeout_ms, normalize, language, word_timestamps, speed, and an on_word_timestamps callback. The async StreamingSession exposes:
MethodReturnsDescription
await session.connect()NoneOpen and authenticate the WebSocket. Called automatically by the first send() and by async with.
session.send(text, flush=False)AsyncIterator[AudioChunk]Buffer text and yield audio as it is generated. flush=True forces synthesis of whatever is buffered.
session.flush()AsyncIterator[AudioChunk]Flush the buffer and yield remaining audio for the current turn.
session.drain()AsyncIterator[AudioChunk]Signal end-of-input and yield every remaining chunk until the server goes idle.
await session.end_session()dictEnd the current turn (flushing remaining text) but keep the WebSocket open for reuse.
await session.cancel_current()NoneBarge-in: abandon the current turn and drop buffered/queued text, keeping the socket open.
session.update_config(config=None, **kwargs)NoneUpdate configuration (e.g. voice_id) for the next session after end_session().
await session.close()dictClose the session and the WebSocket.
session.last_word_timestampslist[WordTimestamp]The most recently received word timestamps.
session.last_finaldict | NoneEnd-of-audio stats from the most recently completed turn — the server’s {"final": true, ...} frame (ElevenLabs isFinal equivalent), sent after the turn’s last audio frame. None before the first turn completes; not updated on a barge-in cancel.
session.last_usageSessionUsage | NonePer-session usage (audio time + amount charged) from the most recently closed session, for billing your own customers per conversation. None before the first session closes. See SessionUsage.
StreamingSessionSync mirrors the async API without await/async for: send(), flush(), and drain() return list[AudioChunk]; cancel_current(), close(), and the last_word_timestamps / last_final / last_usage properties behave the same.

Tuning streaming latency

By default the server accumulates LLM tokens and only begins generating at natural sentence boundaries. Tune how eagerly it starts with these session parameters:
ParameterTypeDefaultDescription
flush_timeout_msint500Server-side auto-flush timeout — emit buffered text after this many milliseconds of no new input.
chunk_length_schedulelist[int] | Noneserver default [5, 80, 150, 250]Minimum buffer size (characters) before each successive auto-chunk is emitted. Entry i applies to chunk i; the last value repeats. Smaller values lower time-to-first-audio; larger values improve prosody.
auto_modebool | NoneNoneStart generating at the very first clean sentence boundary (equivalent to ElevenLabs’ auto_mode). Lowest TTFA, slightly less prosody context.
max_buffer_lengthint1000Maximum characters buffered before a forced flush.
dictionary_idslist[int] | NoneNonePer-session dictionary selection, applied to every turn. None = all active project dictionaries (language-filtered); [] = none; a list = exactly those (including inactive ones), bypassing the language filter.
chunk_length_schedule, auto_mode, and max_buffer_length are set by constructing a StreamConfig and passing it where a config is accepted, or via session.update_config(...):
from kugelaudio.models import StreamConfig

session = await client.tts.streaming_session(voice_id=1071)
session.update_config(StreamConfig(
    voice_id=1071,
    auto_mode=True,
    chunk_length_schedule=[50, 100, 150, 250],  # low-latency schedule
))

Multi-Context Sessions

A multi-context session manages up to 20 independent audio-generation contexts over a single WebSocket (see limits). Each context has its own text buffer, voice settings, and generation queue — useful for multi-speaker conversations, pre-buffering one stream while another plays, or interleaving audio for dynamic dialogue.
async with client.tts.multi_context_session(language="en") as session:
    # Create contexts, optionally with different voices
    await session.create_context("narrator", voice_id=1071)
    await session.create_context("character", voice_id=1072)

    # Send text to a specific context
    async for chunk in session.send("narrator", "The story begins."):
        play_audio(chunk.audio)

    async for chunk in session.send("character", "Hello there!", flush=True):
        play_audio(chunk.audio)

    # Drain remaining audio and close one context
    async for chunk in session.close_context("narrator"):
        play_audio(chunk.audio)
Create the session with multi_context_session(...):
ParameterTypeDefaultDescription
default_voice_idint | NoneNoneDefault voice for contexts that don’t override it.
model_idstr | NoneNoneModel to use.
sample_rateint24000Output sample rate.
output_formatstr | NoneNoneCombined codec + rate token (pcm_8000, pcm_16000, pcm_22050, pcm_24000, ulaw_8000, alaw_8000).
cfg_scalefloat2.0Guidance scale.
temperaturefloat | NoneNoneSampling variance.
max_new_tokensint2048Maximum tokens per generation.
normalizeboolTrueEnable text normalization.
languagestr | NoneNoneNormalization language.
inactivity_timeoutfloat20.0Seconds before an idle context auto-closes.
MultiContextSession methods:
MethodReturnsDescription
await session.connect()NoneOpen the WebSocket. Called automatically by async with.
await session.create_context(context_id, voice_id=None)NoneCreate a context with an optional voice override.
session.send(context_id, text, flush=False, chunk_complete_idle_timeout=None)AsyncIterator[AudioChunk]Send text to a context and yield its audio.
session.flush(context_id)AsyncIterator[AudioChunk]Flush a context’s buffer.
session.close_context(context_id, immediate=False)AsyncIterator[AudioChunk]Close a context and drain its audio. immediate=True barges in, discarding buffered/queued text.
await session.keep_alive(context_id)NoneReset a context’s inactivity timeout.
await session.close()dictClose the session and return stats.
session.get_word_timestamps(context_id)list[WordTimestamp]Latest word timestamps for a context.
session.usage_for(context_id)SessionUsage | NonePer-context usage (audio time + amount charged) for a closed context — each context is its own conversation. None until that context closes. See SessionUsage.
session.context_usagedict[str, SessionUsage]Map of context_id → usage for every context closed so far.
session.active_contextsset[str]The set of currently active context IDs.
session.session_idstr | NoneServer-assigned session ID.
session.is_aliveboolWhether the underlying WebSocket is still usable for send().
Pass on_word_timestamps=callback to multi_context_session(...) to receive (context_id, list[WordTimestamp]) as timestamps arrive.

Word Timestamps in Streaming

Word timestamps work with all streaming methods. During streaming, they are yielded as list[WordTimestamp] objects between audio chunks:
from kugelaudio.models import WordTimestamp

for item in client.tts.stream(
    text="Hello, how are you today?",
    model_id="kugel-3",
    word_timestamps=True,
):
    if hasattr(item, 'audio'):  # AudioChunk
        play_audio(item.audio)
    elif isinstance(item, list) and item and isinstance(item[0], WordTimestamp):
        for ts in item:
            print(f"{ts.word}: {ts.start_ms}-{ts.end_ms}ms")

Word Timestamps in Streaming Sessions

Request word-level time alignments alongside audio. Timestamps are delivered per chunk after the corresponding audio data:
async with client.tts.streaming_session(
    voice_id=1071,
    word_timestamps=True,
) as session:
    async for chunk in session.send("Hello, how are you today?"):
        play_audio(chunk.audio)
    
    async for chunk in session.flush():
        play_audio(chunk.audio)
    
    # Access the latest word timestamps
    timestamps = session.last_word_timestamps
    for ts in timestamps:
        print(f"{ts.word}: {ts.start_ms}ms - {ts.end_ms}ms (score: {ts.score:.2f})")
You can also register a callback to process timestamps as they arrive:
def on_timestamps(timestamps):
    for ts in timestamps:
        print(f"  {ts.word} [{ts.start_ms}-{ts.end_ms}ms]")

async with client.tts.streaming_session(
    voice_id=1071,
    on_word_timestamps=on_timestamps,
) as session:
    async for chunk in session.send("Hello world!"):
        play_audio(chunk.audio)
    async for chunk in session.flush():
        play_audio(chunk.audio)
Word timestamps add no extra audio latency. They arrive shortly after the corresponding audio chunk (see Latency) and are useful for barge-in handling, subtitle synchronization, and lip-sync.

Next steps