Chunking & per-segment latency

/ws/tts/stream is one logical TTS request per turn, regardless of how many send calls you make. The server’s text buffer accumulates tokens and hands a complete chunk to the model the moment it sees a natural boundary (sentence punctuation, or the configured chunk_length_schedule threshold). Inside a single turn, model state (KV cache, voice conditioning) is preserved across chunks so prosody stays natural. Calling flush=true mid-turn breaks that flow: the server treats the flush as a hard segment boundary, runs another full model prefill on whatever has been buffered, and only then emits audio. The cost of that prefill is the full model time-to-first-audio (see Latency) — the same cost you pay on the very first chunk of a turn. Do it on every word and you pay model TTFA on every word.

Chunk-size ordering — pick the largest you can

If you’re driving the session from a layer above raw LLM tokens (for example, a translation pipeline that emits clauses, or a router that batches output before sending), use the largest chunks you can. The ordering, from best to worst time-to-first-audio per emitted segment, is:

Chunk granularity	Verdict
Full turn in one `send`	Best possible. Use when the full text is available before TTS starts.
Sentence-level chunks	Recommended for streamed LLM output.
≥20-character chunks	Acceptable fallback when sentence boundaries aren’t yet available.
Clause-level (comma/semicolon)	Avoid. Each chunk pays model TTFA.
Word-level or sub-word	Don’t. Each chunk pays model TTFA — by far the most expensive shape.

Two important nuances:

Raw LLM tokens are fine as long as you send them without flush=true — the server’s text buffer reassembles them and only hands sentence-sized work to the model. The “word-level is bad” row above applies when you flush after each word, not when you send one word at a time without flushing.
We deliberately don’t publish exact ms figures here — they depend on region, voice, and deployment. The ordering is stable; the absolute numbers aren’t. To reproduce the comparison for your own deployment, run TTFABench.chunkingStrategyBench against your endpoint — see Measuring TTFA correctly.

Tuning auto-chunking

You rarely need this, but two config parameters let you trade prosody context for lower first-chunk latency, without any client-side flushing:

Parameter	Type	Default	Effect
`chunk_length_schedule` / `chunkLengthSchedule`	`list[int]`	`[5, 80, 150, 250]`	Minimum chars buffered before each successive chunk is emitted. Entry `i` applies to chunk `i`; the last value repeats. Smaller = faster TTFA; larger = better prosody.
`auto_mode` / `autoMode`	`bool`	`false`	Start at the very first clean sentence boundary (equivalent to ElevenLabs `auto_mode=true`). Lowest TTFA.

Use the defaults unless you’ve measured a problem.

# Low-latency preset (voice assistants, chatbots)
async with client.tts.streaming_session(
    voice_id=1071,
    auto_mode=True,
    chunk_length_schedule=[50, 100, 150, 250],
) as session:
    async for token in llm_stream:
        async for chunk in session.send(token):
            play_audio(chunk.audio)
    async for chunk in session.flush():
        play_audio(chunk.audio)

# High-quality preset (narration, long-form)
async with client.tts.streaming_session(
    voice_id=1071,
    chunk_length_schedule=[120, 200, 300],
) as session:
    ...

Per-segment latency

For real-time voice agents, per-segment latency (time from sentence boundary to first audio of the next sentence) matters as much as initial TTFA. Two parameters let you trade audio quality for speed:

Parameter	Type	Effect
`optimize_streaming_latency` / `optimizeStreamingLatency`	`bool`	Halve the default diffusion steps for faster per-segment audio. Default: `false`
`num_diffusion_steps` / `numDiffusionSteps`	`int`	Explicit override for diffusion denoising steps (1-50). Lower = faster but lower quality.

# Fastest per-segment latency (voice agents, real-time conversations)
async with client.tts.streaming_session(
    voice_id=1071,
    auto_mode=True,
    optimize_streaming_latency=True,
) as session:
    async for token in llm_stream:
        async for chunk in session.send(token):
            play_audio(chunk.audio)
    async for chunk in session.flush():
        play_audio(chunk.audio)

# Fine-tuned control: explicit diffusion steps
async with client.tts.streaming_session(
    voice_id=1071,
    auto_mode=True,
    num_diffusion_steps=5,  # fewer steps = lower latency
) as session:
    ...

optimize_streaming_latency typically reduces per-segment latency by ~40-50% with a modest quality trade-off that is acceptable for real-time voice conversations. For maximum quality (narration, podcasts), leave it disabled.

Handle backpressure

If audio arrives faster than you can play it, bound your buffer instead of letting it grow:

import asyncio

async def stream_with_backpressure():
    buffer = asyncio.Queue(maxsize=10)  # Limit buffer size

    async def producer():
        async for chunk in client.tts.stream_async(text=text, model=model):
            if hasattr(chunk, 'audio'):
                await buffer.put(chunk.audio)
        await buffer.put(None)  # Signal end

    async def consumer():
        while True:
            audio = await buffer.get()
            if audio is None:
                break
            play_audio(audio)
            await asyncio.sleep(len(audio) / 2 / 24000)  # Simulate playback time

    await asyncio.gather(producer(), consumer())

Common mistakes

Per-segment flush=true. Every flush is a fresh TTS request that pays the full model TTFA. If you flush after every sentence, you pay it N times per turn instead of once.
One session per sentence. A new WebSocket handshake plus a fresh model prefill, every sentence. Keep the same session open for the whole assistant turn; only end it when the turn ends — see Turn lifecycle.
Client-side sentence buffering before send. Unnecessary — the server already buffers tokens and chunks at sentence boundaries. Pre-buffering on the client just adds latency.
Calling send(text, flush=true) per word “for lower latency.” It is the opposite: each flush is a separate model call. Word-granular flushing produces the worst possible TTFA.

If you’re migrating from ElevenLabs, the flush semantics are the biggest behavioral difference — see the ElevenLabs migration guide.

​Chunk-size ordering — pick the largest you can

​Tuning auto-chunking

​Per-segment latency

​Handle backpressure

​Common mistakes

​Next steps

Latency

Turn lifecycle

Chunk-size ordering — pick the largest you can

Tuning auto-chunking

Per-segment latency

Handle backpressure

Common mistakes

Next steps