Skip to main content
On /ws/tts/stream (and via every SDK streaming session), one turn = one assistant utterance = one backend voice session. The WebSocket connection itself persists across turns; the turn is the unit that begins, produces audio, ends with session_closed, and gets billed. Getting the turn boundary right is the difference between a smooth agent and one that pauses mid-sentence or splits a reply in two.

What a turn is

ConceptLifetimeCreated byEnded by
WebSocket connectionMinutes–hours; reusedconnect() / first request{"close_socket": true} or disconnect
Turn (session)One assistant utteranceFirst text after connect or after the previous session_closedflush (recommended), idle auto-flush, or close
The first text message after the connection opens (or after the previous turn ends) starts a new turn on a fresh backend voice session. The configuration (voice_id, sample rate, etc.) is sticky from the previous turn unless you send a new config message.

What triggers synthesis

You do not control when audio generation starts — the server does. It buffers the text you send and hands a chunk to the model at natural sentence boundaries (or when the configured chunk_length_schedule threshold is reached). Inside one turn, model state (KV cache, voice conditioning) carries across chunks, so prosody stays natural. This is why you send LLM tokens directly without flushing: the buffer reassembles them and the model only ever sees sentence-sized work.

How a turn ends

A turn ends in exactly one of three ways:
{"flush": true}
Emits any trailing buffered text and streams its audio. Once the turn’s last audio frame has been delivered, the server sends {"final": true} (the end-of-audio signal), followed by session_closed. The socket stays open for the next turn. End every turn with an explicit flush — it’s the lowest latency path and the only one that’s deterministic.

2. Idle auto-flush — the 5-second timeout

If you stream text but never flush, the server ends the turn for you after ~5 seconds without new text. You’ll receive:
  1. a warning frame — "Turn ended after 5s of inactivity. Send {\"flush\": true} to end a turn explicitly — it lowers latency and avoids this auto-flush."
  2. the remaining buffered text synthesized and streamed,
  3. session_closed.
WebSocket ping/keep-alive frames do NOT reset the 5 s timer. Only new text or a flush does. If your LLM stalls for more than ~5 s mid-turn (slow tool call, long reasoning pause) and you’ve already sent partial text, the turn ends at that point — and the rest of the reply becomes a second turn (billed as a separate session). If you can’t avoid long mid-turn gaps, hold the text client-side until you’re ready to stream it.

3. close — graceful end

{"close": true}
Like flush, but explicit about intent: flush buffered text, drain audio, end the session. {"end_session": true} is an alias. The WebSocket stays open. (To also close the connection, send {"close_socket": true}.) To abandon a turn instead of finishing it — user interrupted, drop the buffered text — that’s barge-in, not close.

final vs session_closed

Every gracefully ended turn closes with two frames, in order:
  1. final — end of audio (the ElevenLabs isFinal equivalent). Sent right after the turn’s last audio frame. Once you receive it, no further audio for the turn will arrive — key your “audio is done” logic (stop waiting, end playback, hang up the call) on this frame.
{
  "final": true,
  "total_audio_seconds": 5.4,
  "total_text_chunks": 3,
  "total_audio_chunks": 15
}
  1. session_closed — end of turn, carrying the usage block for billing:
{
  "session_closed": true,
  "total_audio_seconds": 5.4,
  "total_text_chunks": 3,
  "total_audio_chunks": 15,
  "usage": { "audio_seconds": 5.4, "characters": 142, "cost_cents": 0.49, "currency": "eur", "model_id": "kugel-3" }
}
final is sent on every graceful turn end (flush, idle auto-flush, or close) but not after a cancel — a barge-in acknowledges with interrupted instead. Note that on the single-request /ws/tts endpoint, final is the request-complete message and carries the usage itself — see the Stream Speech reference. Full streaming message reference: Stream Input API.

The warning frame

{"warning": "..."} is a non-fatal advisory; the socket stays open. Today it is emitted in one case: the turn was auto-ended by the idle timeout because no flush was sent. Log it — in a correctly built integration it should never appear.

Who is responsible for what

You are responsible forThe server is responsible for
Sending text as it becomes available (no client-side sentence buffering)Buffering text and choosing sentence-boundary chunks
Sending flush exactly once, at the end of each turnStarting synthesis and preserving prosody across chunks within the turn
Not leaving a turn idle >5 s mid-utteranceEnding idle turns (after a warning) so sessions can’t leak
Keying “audio done” on final and billing on session_closedEmitting final after the last audio frame, then session_closed with usage, on every graceful turn end
Playback, buffering, and barge-in decisionsCancelling generation when you barge in

Session reuse

WebSocket connections are reused across turns, so consecutive turns skip the connection handshake entirely (see Latency for what that saves). After session_closed, just send the next turn’s text — the config (voice_id, audio format, etc.) is sticky. Send a new config message only to change something (e.g. a different voice):
async with client.tts.streaming_session(voice_id=1071) as session:
    # Turn 1
    async for chunk in session.send("Hello from voice one."):
        play_audio(chunk)
    await session.flush()

    # End the turn but keep the connection alive
    await session.end_session()

    # Change voice for turn 2 (no new WebSocket needed)
    session.update_config(voice_id=1072)
    async for chunk in session.send("Hello from voice two."):
        play_audio(chunk)

# close() is called by the context manager — closes session + WebSocket

Per-session usage

Every session_closed frame carries a usage object summarising what the turn consumed and what it was charged — useful for per-conversation billing of your own customers:
FieldDescription
audio_secondsTotal audio generated this turn (the unit we bill on)
charactersTotal input characters submitted this turn
cost_centsActual amount charged, in EUR cents. null if the charge could not be determined (see below)
currencyCurrency of cost_cents ("eur"); present only when cost_cents is set
model_idModel that produced the audio
If the charge cannot be determined at turn end (e.g. a transient billing error), cost_cents is null and the object carries "cost_unavailable": true instead of a misleading 0. audio_seconds is always reported, so you can still reconcile usage from the audio you received. On /ws/tts/multi, usage is reported per context — the usage object is attached to each context_closed frame (each context is its own conversation), carrying that context’s audio_seconds, cost_cents, and currency. The multi session_closed frame is just a session-end signal and does not carry a usage block. (Per-context characters is not reported; derive it client-side if needed.) See Multi-context streaming. The single-request /ws/tts endpoint carries the same usage object on its final message (per request rather than per turn) — see the Stream Speech API reference.

Troubleshooting

Almost always the idle auto-flush. Look for the warning frame in your message log — if it’s there, your client streamed part of the text, then went >5 s without sending more or flushing, so the server ended the turn. The most common causes: a slow LLM/tool call mid-turn, or a missing final flush. Note the turn split also splits billing: you’ll see two session_closed frames, each with its own usage.
Check what you’re waiting for: a graceful turn ends with final (end of audio) followed by session_closed (turn end + usage). A turn ended by cancel (barge-in) emits neither — it acknowledges with interrupted. If you only handle final, a barge-in path will hang.
By design: ping/keep-alive frames don’t reset the idle timer — only new text or flush does. The timer exists to reap abandoned turns; a ping proves the socket is alive, not that the turn is still being written.