How server-side chunking works, why client-side flushing destroys TTFA, and the knobs that tune both.
/ws/tts/stream is one logical TTS request per turn, regardless of how many
send calls you make. The server’s text buffer accumulates tokens and hands a
complete chunk to the model the moment it sees a natural boundary (sentence
punctuation, or the configured chunk_length_schedule threshold). Inside a
single turn, model state (KV cache, voice conditioning) is preserved across
chunks so prosody stays natural.Calling flush=true mid-turn breaks that flow: the server treats the flush as
a hard segment boundary, runs another full model prefill on whatever has been
buffered, and only then emits audio. The cost of that prefill is the full
model time-to-first-audio (see Latency) — the same cost you pay on
the very first chunk of a turn. Do it on every word and you pay model TTFA on
every word.
If you’re driving the session from a layer above raw LLM tokens (for example,
a translation pipeline that emits clauses, or a router that batches output
before sending), use the largest chunks you can. The ordering, from best to
worst time-to-first-audio per emitted segment, is:
Chunk granularity
Verdict
Full turn in one send
Best possible. Use when the full text is available before TTS starts.
Sentence-level chunks
Recommended for streamed LLM output.
≥20-character chunks
Acceptable fallback when sentence boundaries aren’t yet available.
Clause-level (comma/semicolon)
Avoid. Each chunk pays model TTFA.
Word-level or sub-word
Don’t. Each chunk pays model TTFA — by far the most expensive shape.
Two important nuances:
Raw LLM tokens are fine as long as you send them without
flush=true — the server’s text buffer reassembles them and only
hands sentence-sized work to the model. The “word-level is bad” row
above applies when you flush after each word, not when you
send one word at a time without flushing.
We deliberately don’t publish exact ms figures here — they depend on
region, voice, and deployment. The ordering is stable; the absolute
numbers aren’t. To reproduce the comparison for your own deployment, run
TTFABench.chunkingStrategyBench
against your endpoint — see
Measuring TTFA correctly.
You rarely need this, but two config parameters let you trade prosody context
for lower first-chunk latency, without any client-side flushing:
Parameter
Type
Default
Effect
chunk_length_schedule / chunkLengthSchedule
list[int]
[5, 80, 150, 250]
Minimum chars buffered before each successive chunk is emitted. Entry i applies to chunk i; the last value repeats. Smaller = faster TTFA; larger = better prosody.
auto_mode / autoMode
bool
false
Start at the very first clean sentence boundary (equivalent to ElevenLabs auto_mode=true). Lowest TTFA.
Use the defaults unless you’ve measured a problem.
# Low-latency preset (voice assistants, chatbots)async with client.tts.streaming_session( voice_id=1071, auto_mode=True, chunk_length_schedule=[50, 100, 150, 250],) as session: async for token in llm_stream: async for chunk in session.send(token): play_audio(chunk.audio) async for chunk in session.flush(): play_audio(chunk.audio)# High-quality preset (narration, long-form)async with client.tts.streaming_session( voice_id=1071, chunk_length_schedule=[120, 200, 300],) as session: ...
For real-time voice agents, per-segment latency (time from sentence boundary
to first audio of the next sentence) matters as much as initial TTFA. Two
parameters let you trade audio quality for speed:
Halve the default diffusion steps for faster per-segment audio. Default: false
num_diffusion_steps / numDiffusionSteps
int
Explicit override for diffusion denoising steps (1-50). Lower = faster but lower quality.
# Fastest per-segment latency (voice agents, real-time conversations)async with client.tts.streaming_session( voice_id=1071, auto_mode=True, optimize_streaming_latency=True,) as session: async for token in llm_stream: async for chunk in session.send(token): play_audio(chunk.audio) async for chunk in session.flush(): play_audio(chunk.audio)# Fine-tuned control: explicit diffusion stepsasync with client.tts.streaming_session( voice_id=1071, auto_mode=True, num_diffusion_steps=5, # fewer steps = lower latency) as session: ...
optimize_streaming_latency typically reduces per-segment latency by
~40-50% with a modest quality trade-off that is acceptable for real-time
voice conversations. For maximum quality (narration, podcasts), leave it
disabled.
Per-segment flush=true. Every flush is a fresh TTS request that pays
the full model TTFA. If you flush after every sentence, you pay it N times
per turn instead of once.
One session per sentence. A new WebSocket handshake plus a fresh model
prefill, every sentence. Keep the same session open for the whole assistant
turn; only end it when the turn ends — see
Turn lifecycle.
Client-side sentence buffering before send. Unnecessary — the server
already buffers tokens and chunks at sentence boundaries. Pre-buffering on
the client just adds latency.
Calling send(text, flush=true) per word “for lower latency.” It is
the opposite: each flush is a separate model call. Word-granular flushing
produces the worst possible TTFA.
If you’re migrating from ElevenLabs, the flush semantics are the biggest
behavioral difference — see the
ElevenLabs migration guide.