Skip to main content
Stream text input token-by-token for LLM integration. This is the endpoint behind every SDK streaming session; the conceptual guide is Streaming overview and the turn semantics are on Turn lifecycle.
WebSocket

Connection

wss://api.kugelaudio.com/ws/tts/stream?api_key=YOUR_API_KEY

Protocol

  1. Send config (once): Initial configuration message. voice_id, audio format, and the other settings are sticky for the connection — you do not re-send them on later turns.
  2. Send text: Text chunks for the current turn as they arrive
  3. Send flush: Ends the turn — emits any trailing buffered text, streams its audio, then closes the turn’s session (session_closed). The socket stays open.
  4. Next turn: Send the next turn’s text (a fresh config is optional). Repeat. To end the whole connection, send close_socket.
  5. Receive audio: Audio chunks as they’re generated
One turn = one backend session. A turn ends when you send flush (or after a short idle gap — see below); each turn runs on its own freshly-prefilled voice session. A text WebSocket frame is not a hard sentence boundary by itself. For token streams, send raw tokens and flush once at the end of the turn. If your application sends already-complete phrases without terminal punctuation, include flush: true on that message or send a separate flush message.
Idle turns auto-end after 5 seconds. If you stream text but never flush, the server auto-flushes the buffered text after ~5 s of no new text, emits a warning frame, and ends the turn. WebSocket ping/keep-alive frames do not reset this — only sending flush (or new text) does. End each turn with an explicit flush for the lowest latency and to avoid the auto-flush. Full lifecycle: Turn lifecycle.

Messages

Config Message

{
  "voice_id": 1071,
  "model_id": "kugel-3",
  "cfg_scale": 2.0,
  "temperature": 0.4,
  "sample_rate": 24000,
  "normalize": true,
  "language": "en",
  "word_timestamps": false,
  "flush_timeout_ms": 500,
  "max_buffer_length": 1000,
  "chunk_length_schedule": [5, 80, 150, 250],
  "auto_mode": false,
  "speed": 1.0
}
FieldTypeDefaultDescription
temperaturenumber0.4Sampling variance (0.0–1.0). 0 = most stable, 1 = most variance.
flush_timeout_msinteger500Auto-flush buffered text after this many ms of no new input.
max_buffer_lengthinteger1000Maximum characters buffered before a forced flush.
chunk_length_schedulelist[int][5, 80, 150, 250]Minimum buffer size (chars) before each successive chunk auto-emits. Entry i applies to chunk i; the last value repeats. Smaller = lower TTFA; larger = better prosody.
auto_modebooleanfalseStart generating at the first clean sentence boundary, ignoring chunk_length_schedule (equivalent to ElevenLabs auto_mode=true). Lowest TTFA.
dictionary_idsinteger[]omittedPer-request dictionary selection, sticky for the session. Omitted = all active dictionaries (language-filtered); [] = none; a list = exactly those (including inactive ones), bypassing the language filter.
All other fields share the meaning and defaults of the Generate Speech parameters.

Text Message

{
  "text": "chunk of text"
}

Flush Message

{
  "flush": true
}

Close Message

End the current session; the WebSocket stays open and the server starts a fresh session on the next config / text message:
{
  "close": true
}
{"end_session": true} is accepted as an alias. To end the session and close the WebSocket connection, send {"close_socket": true} instead.

Cancel Message (barge-in)

{
  "cancel": true
}
Abandons the current turn immediately: in-flight generation is cancelled and buffered text dropped. The server acknowledges with {"interrupted": true}; the socket stays open for the next turn. See Barge-in.

Response Messages

Generation Started

{
  "generation_started": true,
  "chunk_id": 0,
  "text": "Hello, this is streaming."
}

Audio Chunk

{
  "audio": "base64_encoded_pcm16_data",
  "enc": "pcm_s16le",
  "idx": 0,
  "sr": 24000,
  "samples": 4800,
  "chunk_id": 0
}
Field-by-field reference: Audio formats.

Word Timestamps (when word_timestamps: true)

{
  "word_timestamps": [
    {"word": "Hello", "start_ms": 0, "end_ms": 320, "char_start": 0, "char_end": 5, "score": 0.98}
  ],
  "chunk_id": 0
}

Chunk Complete

{
  "chunk_complete": true,
  "chunk_id": 0,
  "audio_seconds": 1.2,
  "gen_ms": 150
}

Interrupted

Sent only in response to {"cancel": true} — the turn was cancelled and the session is ready for the next turn:
{
  "interrupted": true
}

Warning

Non-fatal advisory; the socket stays open. Currently emitted when a turn is auto-ended after the idle timeout because no flush was sent:
{
  "warning": "Turn ended after 5s of inactivity. Send {\"flush\": true} to end a turn explicitly — it lowers latency and avoids this auto-flush."
}

Final (End of Audio)

Sent after the last audio frame of every gracefully completed turn (explicit flush, close, or idle auto-flush), right before session_closed. Once you receive it, no further audio for the turn will arrive — the equivalent of ElevenLabs’ isFinal. It is not sent after a cancel (barge-in); that path acknowledges with interrupted instead.
{
  "final": true,
  "total_audio_seconds": 5.4,
  "total_text_chunks": 3,
  "total_audio_chunks": 15
}
Use final to stop waiting for audio (e.g. to end playback or hang up a call); use the session_closed frame that follows for usage/billing data.

Session Closed

Sent at the end of every turn (on flush, idle auto-flush, or close). The socket stays open for the next turn.
{
  "session_closed": true,
  "total_audio_seconds": 5.4,
  "total_text_chunks": 3,
  "total_audio_chunks": 15,
  "usage": {
    "audio_seconds": 5.4,
    "characters": 142,
    "cost_cents": 0.49,
    "currency": "eur",
    "model_id": "kugel-3"
  }
}
The usage object reports the session’s consumed audio time and the actual amount charged (EUR cents) so you can bill per conversation — same fields as the /ws/tts final message. cost_cents is null with cost_unavailable: true if the charge can’t be determined (never a silent 0).

Example

import asyncio
import websockets
import json
import base64

async def stream_from_llm(llm_tokens):
    uri = "wss://api.kugelaudio.com/ws/tts/stream?api_key=YOUR_API_KEY"

    async with websockets.connect(uri) as ws:
        # Send config
        await ws.send(json.dumps({
            "voice_id": 1071,
            "model_id": "kugel-3",
            "cfg_scale": 2.0,
        }))

        # Stream tokens
        for token in llm_tokens:
            await ws.send(json.dumps({"text": token}))

            # Check for audio (non-blocking)
            try:
                message = await asyncio.wait_for(ws.recv(), timeout=0.01)
                data = json.loads(message)
                if "audio" in data:
                    audio_bytes = base64.b64decode(data["audio"])
                    play_audio(audio_bytes)
            except asyncio.TimeoutError:
                pass

        # Flush ends the turn (emits session_closed); close_socket ends the connection.
        # For a multi-turn conversation, skip close_socket and just send the next
        # turn's text after session_closed — the config above stays in effect.
        await ws.send(json.dumps({"flush": True}))
        await ws.send(json.dumps({"close_socket": True}))

        # Receive remaining audio
        async for message in ws:
            data = json.loads(message)
            if "audio" in data:
                audio_bytes = base64.b64decode(data["audio"])
                play_audio(audio_bytes)
            if data.get("session_closed"):
                usage = data.get("usage", {})
                # Per-session usage: audio time + actual charge (EUR cents)
                print(f"Usage: {usage.get('audio_seconds')}s, {usage.get('cost_cents')} ct")
                break

# Example usage
tokens = ["Hello, ", "this ", "is ", "streaming ", "from ", "an ", "LLM."]
asyncio.run(stream_from_llm(tokens))

Errors

See Error Codes for the full TTS error lookup table, including HTTP status codes, WebSocket close codes, and rate-limit behavior.