/ws/tts/stream (and via every SDK streaming session), one turn = one
assistant utterance = one backend voice session. The WebSocket connection
itself persists across turns; the turn is the unit that begins, produces
audio, ends with session_closed, and gets billed.
Getting the turn boundary right is the difference between a smooth agent and
one that pauses mid-sentence or splits a reply in two.
What a turn is
| Concept | Lifetime | Created by | Ended by |
|---|---|---|---|
| WebSocket connection | Minutes–hours; reused | connect() / first request | {"close_socket": true} or disconnect |
| Turn (session) | One assistant utterance | First text after connect or after the previous session_closed | flush (recommended), idle auto-flush, or close |
voice_id, sample rate, etc.) is sticky from the previous turn unless you
send a new config message.
What triggers synthesis
You do not control when audio generation starts — the server does. It buffers the text you send and hands a chunk to the model at natural sentence boundaries (or when the configuredchunk_length_schedule
threshold is reached). Inside one turn, model state (KV cache, voice
conditioning) carries across chunks, so prosody stays natural.
This is why you send LLM tokens directly without flushing: the buffer
reassembles them and the model only ever sees sentence-sized work.
How a turn ends
A turn ends in exactly one of three ways:1. flush — recommended
{"final": true} (the end-of-audio signal),
followed by session_closed. The socket stays
open for the next turn. End every turn with an explicit flush — it’s the
lowest latency path and the only one that’s deterministic.
2. Idle auto-flush — the 5-second timeout
If you stream text but never flush, the server ends the turn for you after ~5 seconds without new text. You’ll receive:- a
warningframe —"Turn ended after 5s of inactivity. Send {\"flush\": true} to end a turn explicitly — it lowers latency and avoids this auto-flush." - the remaining buffered text synthesized and streamed,
session_closed.
3. close — graceful end
flush, but explicit about intent: flush buffered text, drain audio, end
the session. {"end_session": true} is an alias. The WebSocket stays open.
(To also close the connection, send {"close_socket": true}.)
To abandon a turn instead of finishing it — user interrupted, drop the
buffered text — that’s barge-in, not close.
final vs session_closed
Every gracefully ended turn closes with two frames, in order:
final— end of audio (the ElevenLabsisFinalequivalent). Sent right after the turn’s last audio frame. Once you receive it, no further audio for the turn will arrive — key your “audio is done” logic (stop waiting, end playback, hang up the call) on this frame.
session_closed— end of turn, carrying the usage block for billing:
final is sent on every graceful turn end (flush, idle auto-flush, or
close) but not after a cancel — a barge-in
acknowledges with interrupted instead. Note that on the single-request
/ws/tts endpoint, final is the request-complete message and carries the
usage itself — see the
Stream Speech reference. Full
streaming message reference:
Stream Input API.
The warning frame
{"warning": "..."} is a non-fatal advisory; the socket stays open. Today it
is emitted in one case: the turn was auto-ended by the idle timeout because no
flush was sent. Log it — in a correctly built integration it should never
appear.
Who is responsible for what
| You are responsible for | The server is responsible for |
|---|---|
| Sending text as it becomes available (no client-side sentence buffering) | Buffering text and choosing sentence-boundary chunks |
Sending flush exactly once, at the end of each turn | Starting synthesis and preserving prosody across chunks within the turn |
| Not leaving a turn idle >5 s mid-utterance | Ending idle turns (after a warning) so sessions can’t leak |
Keying “audio done” on final and billing on session_closed | Emitting final after the last audio frame, then session_closed with usage, on every graceful turn end |
| Playback, buffering, and barge-in decisions | Cancelling generation when you barge in |
Session reuse
WebSocket connections are reused across turns, so consecutive turns skip the connection handshake entirely (see Latency for what that saves). Aftersession_closed, just send the next turn’s text —
the config (voice_id, audio format, etc.) is sticky. Send a new config
message only to change something (e.g. a different voice):
Per-session usage
Everysession_closed frame carries a usage object summarising what the
turn consumed and what it was charged — useful for per-conversation billing
of your own customers:
| Field | Description |
|---|---|
audio_seconds | Total audio generated this turn (the unit we bill on) |
characters | Total input characters submitted this turn |
cost_cents | Actual amount charged, in EUR cents. null if the charge could not be determined (see below) |
currency | Currency of cost_cents ("eur"); present only when cost_cents is set |
model_id | Model that produced the audio |
cost_cents is null and the object carries "cost_unavailable": true
instead of a misleading 0. audio_seconds is always reported, so you can
still reconcile usage from the audio you received.
On /ws/tts/multi, usage is reported per context — the usage object is
attached to each context_closed frame (each context is its own
conversation), carrying that context’s audio_seconds, cost_cents, and
currency. The multi session_closed frame is just a session-end signal and
does not carry a usage block. (Per-context characters is not reported;
derive it client-side if needed.) See
Multi-context streaming.
The single-request /ws/tts endpoint carries the same usage object on its
final message (per request rather than per turn) — see the
Stream Speech API reference.
Troubleshooting
The reply stops early, at a sentence boundary
The reply stops early, at a sentence boundary
Almost always the idle auto-flush. Look for the
warning frame in your
message log — if it’s there, your client streamed part of the text, then
went >5 s without sending more or flushing, so the server ended the turn.
The most common causes: a slow LLM/tool call mid-turn, or a missing final
flush. Note the turn split also splits billing: you’ll see two
session_closed frames, each with its own usage.My client hangs waiting for the end of the stream
My client hangs waiting for the end of the stream
Check what you’re waiting for: a graceful turn ends with
final (end of
audio) followed by session_closed (turn end + usage). A turn ended by
cancel (barge-in) emits neither — it acknowledges with
interrupted. If you only handle final, a barge-in path will hang.I send pings but the turn still ends after 5 s
I send pings but the turn still ends after 5 s
By design: ping/keep-alive frames don’t reset the idle timer — only new
text or
flush does. The timer exists to reap abandoned turns; a ping
proves the socket is alive, not that the turn is still being written.