Streaming - KugelAudio

Streaming Audio

Receive audio chunks as they are generated for lower latency:

# Synchronous streaming
for item in client.tts.stream(
    text="Hello, this is streaming audio.",
    model_id="kugel-3",
):
    if hasattr(item, 'audio'):  # AudioChunk
        # Process audio chunk immediately
        print(f"Chunk {item.index}: {len(item.audio)} bytes, {item.samples} samples")
        # play_audio(item.audio)
    elif isinstance(item, dict) and item.get('final'):
        # Final stats
        print(f"Total duration: {item.get('dur_ms', 0):.0f}ms")
        print(f"Generation time: {item.get('gen_ms', 0):.0f}ms")

Async Streaming

For async applications:

import asyncio

async def generate_speech():
    async for item in client.tts.stream_async(
        text="Async streaming example.",
        model_id="kugel-3",
    ):
        if hasattr(item, 'audio'):
            # Process chunk
            pass

asyncio.run(generate_speech())

LLM Integration: Streaming Sessions

For real-time TTS when streaming text from an LLM (like GPT-4, Claude, etc.):

Async Streaming Session

import asyncio

async def stream_from_llm():
    # Simulate LLM token stream
    llm_tokens = ["Hello, ", "this ", "is ", "a ", "streamed ", "response."]
    
    async with client.tts.streaming_session(
        voice_id=1071,
        cfg_scale=2.0,
        flush_timeout_ms=500,  # Auto-flush after 500ms of no input
    ) as session:
        # Send tokens as they arrive from LLM
        for token in llm_tokens:
            async for chunk in session.send(token):
                # Play audio chunk immediately
                play_audio(chunk.audio)
        
        # Flush any remaining text
        async for chunk in session.flush():
            play_audio(chunk.audio)

asyncio.run(stream_from_llm())

Synchronous Streaming Session

with client.tts.streaming_session_sync(voice_id=1071) as session:
    for token in llm_tokens:
        for chunk in session.send(token):
            play_audio(chunk.audio)
    
    for chunk in session.flush():
        play_audio(chunk.audio)

Session Reuse

End a session without closing the WebSocket to avoid reconnection overhead when starting a new session (see Turn lifecycle):

session = await client.tts.streaming_session(voice_id=1071)

# Session 1
async for chunk in session.send("Hello from voice one."):
    play_audio(chunk.audio)
await session.end_session()  # Keeps WebSocket open

# Session 2 — no reconnection needed
session.update_config(voice_id=1072)
async for chunk in session.send("Hello from voice two."):
    play_audio(chunk.audio)

await session.close()  # Closes session + WebSocket

Barge-in (interrupt the current turn)

When the end user speaks over the agent, call cancel_current() to stop generating the current turn immediately and drop any buffered/queued text — without closing the WebSocket. Unlike end_session(), no remaining text is flushed; the turn is abandoned. The socket stays open so the next send() starts the next turn right away.

session = await client.tts.streaming_session(voice_id=1071)

async for chunk in session.send("This is a very long answer the user talks over"):
    play_audio(chunk.audio)

# VAD detected the user speaking — barge in:
await session.cancel_current()

# Socket still open — next turn starts immediately:
async for chunk in session.send("Sure, what would you like instead?", flush=True):
    play_audio(chunk.audio)

cancel_current() returns once the server acknowledges, or after a short quiet timeout if the server goes silent. Stop local playback as soon as you call it — a few in-flight frames may arrive before the acknowledgement. See Barge-in for the full protocol. The synchronous wrapper exposes cancel_current() too.

Streaming session reference

A session is created with streaming_session(...) (async) or streaming_session_sync(...) (sync). Both accept the same configuration: voice_id, model_id, cfg_scale, temperature, max_new_tokens, sample_rate, flush_timeout_ms, normalize, language, word_timestamps, speed, and an on_word_timestamps callback. The async StreamingSession exposes:

Method	Returns	Description
`await session.connect()`	`None`	Open and authenticate the WebSocket. Called automatically by the first `send()` and by `async with`.
`session.send(text, flush=False)`	`AsyncIterator[AudioChunk]`	Buffer `text` and yield audio as it is generated. `flush=True` forces synthesis of whatever is buffered.
`session.flush()`	`AsyncIterator[AudioChunk]`	Flush the buffer and yield remaining audio for the current turn.
`session.drain()`	`AsyncIterator[AudioChunk]`	Signal end-of-input and yield every remaining chunk until the server goes idle.
`await session.end_session()`	`dict`	End the current turn (flushing remaining text) but keep the WebSocket open for reuse.
`await session.cancel_current()`	`None`	Barge-in: abandon the current turn and drop buffered/queued text, keeping the socket open.
`session.update_config(config=None, **kwargs)`	`None`	Update configuration (e.g. `voice_id`) for the next session after `end_session()`.
`await session.close()`	`dict`	Close the session and the WebSocket.
`session.last_word_timestamps`	`list[WordTimestamp]`	The most recently received word timestamps.
`session.last_final`	`dict \| None`	End-of-audio stats from the most recently completed turn — the server’s `{"final": true, ...}` frame (ElevenLabs `isFinal` equivalent), sent after the turn’s last audio frame. `None` before the first turn completes; not updated on a barge-in cancel.
`session.last_usage`	`SessionUsage \| None`	Per-session usage (audio time + amount charged) from the most recently closed session, for billing your own customers per conversation. `None` before the first session closes. See SessionUsage.

StreamingSessionSync mirrors the async API without await/async for: send(), flush(), and drain() return list[AudioChunk]; cancel_current(), close(), and the last_word_timestamps / last_final / last_usage properties behave the same.

Tuning streaming latency

By default the server accumulates LLM tokens and only begins generating at natural sentence boundaries. Tune how eagerly it starts with these session parameters:

Parameter	Type	Default	Description
`flush_timeout_ms`	`int`	`500`	Server-side auto-flush timeout — emit buffered text after this many milliseconds of no new input.
`chunk_length_schedule`	`list[int] \| None`	server default `[5, 80, 150, 250]`	Minimum buffer size (characters) before each successive auto-chunk is emitted. Entry `i` applies to chunk `i`; the last value repeats. Smaller values lower time-to-first-audio; larger values improve prosody.
`auto_mode`	`bool \| None`	`None`	Start generating at the very first clean sentence boundary (equivalent to ElevenLabs’ `auto_mode`). Lowest TTFA, slightly less prosody context.
`max_buffer_length`	`int`	`1000`	Maximum characters buffered before a forced flush.
`dictionary_ids`	`list[int] \| None`	`None`	Per-session dictionary selection, applied to every turn. `None` = all active project dictionaries (language-filtered); `[]` = none; a list = exactly those (including inactive ones), bypassing the language filter.

chunk_length_schedule, auto_mode, and max_buffer_length are set by constructing a StreamConfig and passing it where a config is accepted, or via session.update_config(...):

from kugelaudio.models import StreamConfig

session = await client.tts.streaming_session(voice_id=1071)
session.update_config(StreamConfig(
    voice_id=1071,
    auto_mode=True,
    chunk_length_schedule=[50, 100, 150, 250],  # low-latency schedule
))

Multi-Context Sessions

A multi-context session manages up to 20 independent audio-generation contexts over a single WebSocket (see limits). Each context has its own text buffer, voice settings, and generation queue — useful for multi-speaker conversations, pre-buffering one stream while another plays, or interleaving audio for dynamic dialogue.

async with client.tts.multi_context_session(language="en") as session:
    # Create contexts, optionally with different voices
    await session.create_context("narrator", voice_id=1071)
    await session.create_context("character", voice_id=1072)

    # Send text to a specific context
    async for chunk in session.send("narrator", "The story begins."):
        play_audio(chunk.audio)

    async for chunk in session.send("character", "Hello there!", flush=True):
        play_audio(chunk.audio)

    # Drain remaining audio and close one context
    async for chunk in session.close_context("narrator"):
        play_audio(chunk.audio)

Create the session with multi_context_session(...):

Parameter	Type	Default	Description
`default_voice_id`	`int \| None`	`None`	Default voice for contexts that don’t override it.
`model_id`	`str \| None`	`None`	Model to use.
`sample_rate`	`int`	`24000`	Output sample rate.
`output_format`	`str \| None`	`None`	Combined codec + rate token (`pcm_8000`, `pcm_16000`, `pcm_22050`, `pcm_24000`, `ulaw_8000`, `alaw_8000`).
`cfg_scale`	`float`	`2.0`	Guidance scale.
`temperature`	`float \| None`	`None`	Sampling variance.
`max_new_tokens`	`int`	`2048`	Maximum tokens per generation.
`normalize`	`bool`	`True`	Enable text normalization.
`language`	`str \| None`	`None`	Normalization language.
`inactivity_timeout`	`float`	`20.0`	Seconds before an idle context auto-closes.

MultiContextSession methods:

Method	Returns	Description
`await session.connect()`	`None`	Open the WebSocket. Called automatically by `async with`.
`await session.create_context(context_id, voice_id=None)`	`None`	Create a context with an optional voice override.
`session.send(context_id, text, flush=False, chunk_complete_idle_timeout=None)`	`AsyncIterator[AudioChunk]`	Send text to a context and yield its audio.
`session.flush(context_id)`	`AsyncIterator[AudioChunk]`	Flush a context’s buffer.
`session.close_context(context_id, immediate=False)`	`AsyncIterator[AudioChunk]`	Close a context and drain its audio. `immediate=True` barges in, discarding buffered/queued text.
`await session.keep_alive(context_id)`	`None`	Reset a context’s inactivity timeout.
`await session.close()`	`dict`	Close the session and return stats.
`session.get_word_timestamps(context_id)`	`list[WordTimestamp]`	Latest word timestamps for a context.
`session.usage_for(context_id)`	`SessionUsage \| None`	Per-context usage (audio time + amount charged) for a closed context — each context is its own conversation. `None` until that context closes. See SessionUsage.
`session.context_usage`	`dict[str, SessionUsage]`	Map of `context_id` → usage for every context closed so far.
`session.active_contexts`	`set[str]`	The set of currently active context IDs.
`session.session_id`	`str \| None`	Server-assigned session ID.
`session.is_alive`	`bool`	Whether the underlying WebSocket is still usable for `send()`.

Pass on_word_timestamps=callback to multi_context_session(...) to receive (context_id, list[WordTimestamp]) as timestamps arrive.

Word Timestamps in Streaming

Word timestamps work with all streaming methods. During streaming, they are yielded as list[WordTimestamp] objects between audio chunks:

from kugelaudio.models import WordTimestamp

for item in client.tts.stream(
    text="Hello, how are you today?",
    model_id="kugel-3",
    word_timestamps=True,
):
    if hasattr(item, 'audio'):  # AudioChunk
        play_audio(item.audio)
    elif isinstance(item, list) and item and isinstance(item[0], WordTimestamp):
        for ts in item:
            print(f"{ts.word}: {ts.start_ms}-{ts.end_ms}ms")

Word Timestamps in Streaming Sessions

Request word-level time alignments alongside audio. Timestamps are delivered per chunk after the corresponding audio data:

async with client.tts.streaming_session(
    voice_id=1071,
    word_timestamps=True,
) as session:
    async for chunk in session.send("Hello, how are you today?"):
        play_audio(chunk.audio)
    
    async for chunk in session.flush():
        play_audio(chunk.audio)
    
    # Access the latest word timestamps
    timestamps = session.last_word_timestamps
    for ts in timestamps:
        print(f"{ts.word}: {ts.start_ms}ms - {ts.end_ms}ms (score: {ts.score:.2f})")

You can also register a callback to process timestamps as they arrive:

def on_timestamps(timestamps):
    for ts in timestamps:
        print(f"  {ts.word} [{ts.start_ms}-{ts.end_ms}ms]")

async with client.tts.streaming_session(
    voice_id=1071,
    on_word_timestamps=on_timestamps,
) as session:
    async for chunk in session.send("Hello world!"):
        play_audio(chunk.audio)
    async for chunk in session.flush():
        play_audio(chunk.audio)

Word timestamps add no extra audio latency. They arrive shortly after the corresponding audio chunk (see Latency) and are useful for barge-in handling, subtitle synchronization, and lip-sync.

Next steps

Types & Errors — AudioChunk, StreamConfig, SessionUsage, WordTimestamp
Text Normalization — languages and spell tags in streaming

​Streaming Audio

​Async Streaming

​LLM Integration: Streaming Sessions

​Async Streaming Session

​Synchronous Streaming Session

​Session Reuse

​Barge-in (interrupt the current turn)

​Streaming session reference

​Tuning streaming latency

​Multi-Context Sessions

​Word Timestamps in Streaming

​Word Timestamps in Streaming Sessions

​Next steps

Streaming Audio

Async Streaming

LLM Integration: Streaming Sessions

Async Streaming Session

Synchronous Streaming Session

Session Reuse

Barge-in (interrupt the current turn)

Streaming session reference

Tuning streaming latency

Multi-Context Sessions

Word Timestamps in Streaming

Word Timestamps in Streaming Sessions

Next steps