Skip to main content
For advanced use cases like multi-speaker conversations or pre-buffering audio, use the multi-context WebSocket endpoint (/ws/tts/multi). This allows managing up to 20 independent audio streams over a single connection.

Use cases

  • Multi-speaker conversations: Generate audio for different speakers concurrently
  • Pre-buffering: Start generating the next response while the current one plays
  • Interleaved audio: Dynamically switch between speakers in real-time

Example

import asyncio
import websockets
import json
import base64

async def multi_speaker_demo():
    async with websockets.connect(
        "wss://api.kugelaudio.com/ws/tts/multi?api_key=YOUR_API_KEY"
    ) as ws:
        # Initialize narrator context
        await ws.send(json.dumps({
            "text": " ",
            "context_id": "narrator",
            "voice_settings": {"voice_id": 1071},
        }))

        # Create character context
        await ws.send(json.dumps({
            "text": " ",
            "context_id": "character",
            "voice_settings": {"voice_id": 1072},
        }))

        # Send text to different speakers
        await ws.send(json.dumps({
            "text": "The story begins.",
            "context_id": "narrator",
            "flush": True,
        }))

        await ws.send(json.dumps({
            "text": "Hello, I'm the main character!",
            "context_id": "character",
            "flush": True,
        }))

        # Receive audio from both contexts
        async for message in ws:
            data = json.loads(message)

            if "audio" in data:
                context_id = data["context_id"]
                audio_bytes = base64.b64decode(data["audio"])
                play_audio(context_id, audio_bytes)

            if data.get("context_closed"):
                # Per-context usage for this conversation: audio time + charge
                print(f"[{data['context_id']}] usage: {data.get('usage')}")

            if data.get("session_closed"):
                break

        # Close when done
        await ws.send(json.dumps({"close_socket": True}))

asyncio.run(multi_speaker_demo())

Protocol summary

Each message you send carries a context_id; the first message for a new ID (typically {"text": " ", "context_id": "...", "voice_settings": {...}}) creates the context. Per context you can send text, flush, and close_context (with "immediate": true for barge-in); {"close_socket": true} ends everything. The server tags every response frame (audio, chunk_complete, word_timestamps, final, context_closed, session_closed) with the originating context_id. After each flush, a final frame (ElevenLabs is_final equivalent) signals that all audio for the flushed text has been delivered. The full message tables — every field of every client→server and server→client frame — live in the Text-to-Speech API reference. Usage is billed per context: each context_closed frame carries that context’s usage block. See Per-session usage.

Limits

  • Maximum 20 concurrent contexts per connection
  • Contexts auto-close after 20 seconds of inactivity
  • Send empty text {"text": "", "context_id": "..."} to reset the per-context inactivity timeout
  • Opening a context beyond the limit returns a per-context error (error_code: "TOO_MANY_CONTEXTS", code: 429) without closing the connection — close an existing context, or wait for an idle one to be released, then retry.