Skip to main content
/ws/tts/stream is one logical TTS request per turn, regardless of how many send calls you make. The server’s text buffer accumulates tokens and hands a complete chunk to the model the moment it sees a natural boundary (sentence punctuation, or the configured chunk_length_schedule threshold). Inside a single turn, model state (KV cache, voice conditioning) is preserved across chunks so prosody stays natural. Calling flush=true mid-turn breaks that flow: the server treats the flush as a hard segment boundary, runs another full model prefill on whatever has been buffered, and only then emits audio. The cost of that prefill is the full model time-to-first-audio (see Latency) — the same cost you pay on the very first chunk of a turn. Do it on every word and you pay model TTFA on every word.

Chunk-size ordering — pick the largest you can

If you’re driving the session from a layer above raw LLM tokens (for example, a translation pipeline that emits clauses, or a router that batches output before sending), use the largest chunks you can. The ordering, from best to worst time-to-first-audio per emitted segment, is:
Chunk granularityVerdict
Full turn in one sendBest possible. Use when the full text is available before TTS starts.
Sentence-level chunksRecommended for streamed LLM output.
≥20-character chunksAcceptable fallback when sentence boundaries aren’t yet available.
Clause-level (comma/semicolon)Avoid. Each chunk pays model TTFA.
Word-level or sub-wordDon’t. Each chunk pays model TTFA — by far the most expensive shape.
Two important nuances:
  • Raw LLM tokens are fine as long as you send them without flush=true — the server’s text buffer reassembles them and only hands sentence-sized work to the model. The “word-level is bad” row above applies when you flush after each word, not when you send one word at a time without flushing.
  • We deliberately don’t publish exact ms figures here — they depend on region, voice, and deployment. The ordering is stable; the absolute numbers aren’t. To reproduce the comparison for your own deployment, run TTFABench.chunkingStrategyBench against your endpoint — see Measuring TTFA correctly.

Tuning auto-chunking

You rarely need this, but two config parameters let you trade prosody context for lower first-chunk latency, without any client-side flushing:
ParameterTypeDefaultEffect
chunk_length_schedule / chunkLengthSchedulelist[int][5, 80, 150, 250]Minimum chars buffered before each successive chunk is emitted. Entry i applies to chunk i; the last value repeats. Smaller = faster TTFA; larger = better prosody.
auto_mode / autoModeboolfalseStart at the very first clean sentence boundary (equivalent to ElevenLabs auto_mode=true). Lowest TTFA.
Use the defaults unless you’ve measured a problem.
# Low-latency preset (voice assistants, chatbots)
async with client.tts.streaming_session(
    voice_id=1071,
    auto_mode=True,
    chunk_length_schedule=[50, 100, 150, 250],
) as session:
    async for token in llm_stream:
        async for chunk in session.send(token):
            play_audio(chunk.audio)
    async for chunk in session.flush():
        play_audio(chunk.audio)

# High-quality preset (narration, long-form)
async with client.tts.streaming_session(
    voice_id=1071,
    chunk_length_schedule=[120, 200, 300],
) as session:
    ...

Per-segment latency

For real-time voice agents, per-segment latency (time from sentence boundary to first audio of the next sentence) matters as much as initial TTFA. Two parameters let you trade audio quality for speed:
ParameterTypeEffect
optimize_streaming_latency / optimizeStreamingLatencyboolHalve the default diffusion steps for faster per-segment audio. Default: false
num_diffusion_steps / numDiffusionStepsintExplicit override for diffusion denoising steps (1-50). Lower = faster but lower quality.
# Fastest per-segment latency (voice agents, real-time conversations)
async with client.tts.streaming_session(
    voice_id=1071,
    auto_mode=True,
    optimize_streaming_latency=True,
) as session:
    async for token in llm_stream:
        async for chunk in session.send(token):
            play_audio(chunk.audio)
    async for chunk in session.flush():
        play_audio(chunk.audio)

# Fine-tuned control: explicit diffusion steps
async with client.tts.streaming_session(
    voice_id=1071,
    auto_mode=True,
    num_diffusion_steps=5,  # fewer steps = lower latency
) as session:
    ...
optimize_streaming_latency typically reduces per-segment latency by ~40-50% with a modest quality trade-off that is acceptable for real-time voice conversations. For maximum quality (narration, podcasts), leave it disabled.

Handle backpressure

If audio arrives faster than you can play it, bound your buffer instead of letting it grow:
import asyncio

async def stream_with_backpressure():
    buffer = asyncio.Queue(maxsize=10)  # Limit buffer size

    async def producer():
        async for chunk in client.tts.stream_async(text=text, model=model):
            if hasattr(chunk, 'audio'):
                await buffer.put(chunk.audio)
        await buffer.put(None)  # Signal end

    async def consumer():
        while True:
            audio = await buffer.get()
            if audio is None:
                break
            play_audio(audio)
            await asyncio.sleep(len(audio) / 2 / 24000)  # Simulate playback time

    await asyncio.gather(producer(), consumer())

Common mistakes

  • Per-segment flush=true. Every flush is a fresh TTS request that pays the full model TTFA. If you flush after every sentence, you pay it N times per turn instead of once.
  • One session per sentence. A new WebSocket handshake plus a fresh model prefill, every sentence. Keep the same session open for the whole assistant turn; only end it when the turn ends — see Turn lifecycle.
  • Client-side sentence buffering before send. Unnecessary — the server already buffers tokens and chunks at sentence boundaries. Pre-buffering on the client just adds latency.
  • Calling send(text, flush=true) per word “for lower latency.” It is the opposite: each flush is a separate model call. Word-granular flushing produces the worst possible TTFA.
If you’re migrating from ElevenLabs, the flush semantics are the biggest behavioral difference — see the ElevenLabs migration guide.

Next steps

Latency

The numbers: what to expect, and how to measure TTFA correctly

Turn lifecycle

Flush semantics, the 5 s idle auto-flush, session reuse