WebSocket
Connection
Client → Server Messages
| Message | Description |
|---|---|
{"text": " ", "context_id": "ctx1", "voice_settings": {"voice_id": 1071}} | Initialize context with voice |
{"text": "Hello", "context_id": "ctx1"} | Send text to context |
{"text": "...", "context_id": "ctx1", "flush": true} | Send text and flush buffer |
{"flush": true, "context_id": "ctx1"} | Flush context buffer |
{"text": "", "context_id": "ctx1"} | Keep-alive: an empty-text frame resets the context’s inactivity timeout without generating audio |
{"close_context": true, "context_id": "ctx1"} | Close a context, letting queued sentences finish first |
{"close_context": true, "context_id": "ctx1", "immediate": true} | Barge-in: cancel the context’s in-flight generation immediately and drop buffered text — see Barge-in |
{"close_socket": true} | Close all contexts and connection |
Server → Client Messages
| Message | Description |
|---|---|
{"context_created": true, "context_id": "ctx1"} | Context created |
{"generation_started": true, "context_id": "ctx1", "chunk_id": 0, "text": "..."} | Generation started |
{"audio": "base64...", "enc": "pcm_s16le", "context_id": "ctx1", "idx": 0, "sr": 24000, "samples": 4800, "chunk_id": 0} | Audio chunk (field reference) |
{"chunk_complete": true, "context_id": "ctx1", "chunk_id": 0, "audio_seconds": 1.2, "gen_ms": 150} | Chunk complete |
{"word_timestamps": [...], "context_id": "ctx1", "chunk_id": 0} | Word-level time alignments (when enabled) |
{"final": true, "context_id": "ctx1"} | End of audio for a flush (ElevenLabs is_final equivalent): every audio frame for text sent before your {"flush": true} has been delivered. Also sent right before context_closed on a graceful close. Not sent on an immediate (barge-in) close |
{"context_closed": true, "context_id": "ctx1", "usage": {"audio_seconds": 4.1, "cost_cents": 0.37, "currency": "eur", "model_id": "kugel-3"}} | Context closed (terminal — all audio sent). usage carries this conversation’s audio time + amount charged (EUR cents; null + cost_unavailable if undetermined) |
{"session_closed": true, "total_audio_seconds": 5.4} | Session ended (all contexts). Per-conversation usage is on each context_closed, not here |
Voice Settings
When creating a context, pass voice settings as a nested object:Session-Level Config
These options can be set on any message and apply to the entire session:| Parameter | Type | Default | Description |
|---|---|---|---|
model_id | string | kugel-3 | Model to use for generation. Use kugel-3 for new integrations. |
sample_rate | integer | 24000 | Output sample rate in Hz. Options: 8000, 16000, 22050, 24000 |
output_format | string | - | Combined codec + rate token (e.g. ulaw_8000) — see Audio formats. Set-once per session; may be sent top-level or inside voice_settings. |
normalize | boolean | true | Enable text normalization |
language | string | - | ISO 639-1 language code for normalization |
word_timestamps | boolean | false | Enable word-level timestamp alignment |
dictionary_ids | integer[] | omitted | Per-session dictionary selection. Omitted = all active dictionaries (language-filtered); [] = none; a list = exactly those (including inactive ones), bypassing the language filter |
context_id across turns to keep one context alive
(recommended for a single conversation), or open new ids for parallel
speakers:
Example
Limits
- Maximum 20 concurrent contexts per connection
- Contexts auto-close after 20 seconds of inactivity (send the empty-text keep-alive to reset)
- Opening a context beyond the limit returns a per-context error (
error_code: "TOO_MANY_CONTEXTS",code: 429) without closing the connection — close an existing context, or wait for an idle one to be released, then retry.