WebSocket
Connection
Protocol
- Send config (once): Initial configuration message.
voice_id, audio format, and the other settings are sticky for the connection — you do not re-send them on later turns. - Send text: Text chunks for the current turn as they arrive
- Send flush: Ends the turn — emits any trailing buffered text, streams its
audio, then closes the turn’s session (
session_closed). The socket stays open. - Next turn: Send the next turn’s text (a fresh config is optional). Repeat.
To end the whole connection, send
close_socket. - Receive audio: Audio chunks as they’re generated
One turn = one backend session. A turn ends when you send
flush (or after
a short idle gap — see below); each turn runs on its own freshly-prefilled
voice session. A text WebSocket frame is not a hard sentence boundary by
itself. For token streams, send raw tokens and flush once at the end of the
turn. If your application sends already-complete phrases without terminal
punctuation, include flush: true on that message or send a separate flush
message.Idle turns auto-end after 5 seconds. If you stream text but never
flush,
the server auto-flushes the buffered text after ~5 s of no new text, emits a
warning frame, and ends the turn. WebSocket ping/keep-alive frames
do not reset this — only sending flush (or new text) does. End each turn
with an explicit flush for the lowest latency and to avoid the auto-flush.
Full lifecycle: Turn lifecycle.Messages
Config Message
| Field | Type | Default | Description |
|---|---|---|---|
temperature | number | 0.4 | Sampling variance (0.0–1.0). 0 = most stable, 1 = most variance. |
flush_timeout_ms | integer | 500 | Auto-flush buffered text after this many ms of no new input. |
max_buffer_length | integer | 1000 | Maximum characters buffered before a forced flush. |
chunk_length_schedule | list[int] | [5, 80, 150, 250] | Minimum buffer size (chars) before each successive chunk auto-emits. Entry i applies to chunk i; the last value repeats. Smaller = lower TTFA; larger = better prosody. |
auto_mode | boolean | false | Start generating at the first clean sentence boundary, ignoring chunk_length_schedule (equivalent to ElevenLabs auto_mode=true). Lowest TTFA. |
dictionary_ids | integer[] | omitted | Per-request dictionary selection, sticky for the session. Omitted = all active dictionaries (language-filtered); [] = none; a list = exactly those (including inactive ones), bypassing the language filter. |
Text Message
Flush Message
Close Message
End the current session; the WebSocket stays open and the server starts a fresh session on the next config / text message:{"end_session": true} is accepted as an alias. To end the session and close
the WebSocket connection, send {"close_socket": true} instead.
Cancel Message (barge-in)
{"interrupted": true};
the socket stays open for the next turn. See Barge-in.
Response Messages
Generation Started
Audio Chunk
Word Timestamps (when word_timestamps: true)
Chunk Complete
Interrupted
Sent only in response to{"cancel": true} — the turn was cancelled and the
session is ready for the next turn:
Warning
Non-fatal advisory; the socket stays open. Currently emitted when a turn is auto-ended after the idle timeout because noflush was sent:
Final (End of Audio)
Sent after the last audio frame of every gracefully completed turn (explicitflush, close, or idle auto-flush), right before
session_closed. Once you receive it, no further audio for the turn will
arrive — the equivalent of ElevenLabs’ isFinal. It is not sent after a
cancel (barge-in); that path acknowledges with interrupted instead.
final to stop waiting for audio (e.g. to end playback or hang up a
call); use the session_closed frame that follows for usage/billing data.
Session Closed
Sent at the end of every turn (onflush, idle auto-flush, or close). The
socket stays open for the next turn.
usage object reports the session’s consumed audio time and the actual
amount charged (EUR cents) so you can bill per conversation — same fields as
the /ws/tts final message.
cost_cents is null with cost_unavailable: true if the charge can’t be
determined (never a silent 0).