Stream Input - KugelAudio

Stream text input token-by-token for LLM integration. This is the endpoint behind every SDK streaming session; the conceptual guide is Streaming overview and the turn semantics are on Turn lifecycle.

WebSocket

Connection

wss://api.kugelaudio.com/ws/tts/stream?api_key=YOUR_API_KEY

Protocol

Send config (once): Initial configuration message. voice_id, audio format, and the other settings are sticky for the connection — you do not re-send them on later turns.
Send text: Text chunks for the current turn as they arrive
Send flush: Ends the turn — emits any trailing buffered text, streams its audio, then closes the turn’s session (session_closed). The socket stays open.
Next turn: Send the next turn’s text (a fresh config is optional). Repeat. To end the whole connection, send close_socket.
Receive audio: Audio chunks as they’re generated

One turn = one backend session. A turn ends when you send flush (or after a short idle gap — see below); each turn runs on its own freshly-prefilled voice session. A text WebSocket frame is not a hard sentence boundary by itself. For token streams, send raw tokens and flush once at the end of the turn. If your application sends already-complete phrases without terminal punctuation, include flush: true on that message or send a separate flush message.

Idle turns auto-end after 5 seconds. If you stream text but never flush, the server auto-flushes the buffered text after ~5 s of no new text, emits a warning frame, and ends the turn. WebSocket ping/keep-alive frames do not reset this — only sending flush (or new text) does. End each turn with an explicit flush for the lowest latency and to avoid the auto-flush. Full lifecycle: Turn lifecycle.

Messages

Config Message

{
  "voice_id": 1071,
  "model_id": "kugel-3",
  "cfg_scale": 2.0,
  "temperature": 0.4,
  "sample_rate": 24000,
  "normalize": true,
  "language": "en",
  "word_timestamps": false,
  "flush_timeout_ms": 500,
  "max_buffer_length": 1000,
  "chunk_length_schedule": [5, 80, 150, 250],
  "auto_mode": false,
  "speed": 1.0
}

Field	Type	Default	Description
`temperature`	number	`0.4`	Sampling variance (0.0–1.0). 0 = most stable, 1 = most variance.
`flush_timeout_ms`	integer	`500`	Auto-flush buffered text after this many ms of no new input.
`max_buffer_length`	integer	`1000`	Maximum characters buffered before a forced flush.
`chunk_length_schedule`	`list[int]`	`[5, 80, 150, 250]`	Minimum buffer size (chars) before each successive chunk auto-emits. Entry `i` applies to chunk `i`; the last value repeats. Smaller = lower TTFA; larger = better prosody.
`auto_mode`	boolean	`false`	Start generating at the first clean sentence boundary, ignoring `chunk_length_schedule` (equivalent to ElevenLabs `auto_mode=true`). Lowest TTFA.
`dictionary_ids`	`integer[]`	omitted	Per-request dictionary selection, sticky for the session. Omitted = all active dictionaries (language-filtered); `[]` = none; a list = exactly those (including inactive ones), bypassing the language filter.

All other fields share the meaning and defaults of the Generate Speech parameters.

Text Message

{
  "text": "chunk of text"
}

Flush Message

{
  "flush": true
}

Close Message

End the current session; the WebSocket stays open and the server starts a fresh session on the next config / text message:

{
  "close": true
}

{"end_session": true} is accepted as an alias. To end the session and close the WebSocket connection, send {"close_socket": true} instead.

Cancel Message (barge-in)

{
  "cancel": true
}

Abandons the current turn immediately: in-flight generation is cancelled and buffered text dropped. The server acknowledges with {"interrupted": true}; the socket stays open for the next turn. See Barge-in.

Response Messages

Generation Started

{
  "generation_started": true,
  "chunk_id": 0,
  "text": "Hello, this is streaming."
}

Audio Chunk

{
  "audio": "base64_encoded_pcm16_data",
  "enc": "pcm_s16le",
  "idx": 0,
  "sr": 24000,
  "samples": 4800,
  "chunk_id": 0
}

Field-by-field reference: Audio formats.

Word Timestamps (when `word_timestamps: true`)

{
  "word_timestamps": [
    {"word": "Hello", "start_ms": 0, "end_ms": 320, "char_start": 0, "char_end": 5, "score": 0.98}
  ],
  "chunk_id": 0
}

Chunk Complete

{
  "chunk_complete": true,
  "chunk_id": 0,
  "audio_seconds": 1.2,
  "gen_ms": 150
}

Interrupted

Sent only in response to {"cancel": true} — the turn was cancelled and the session is ready for the next turn:

{
  "interrupted": true
}

Warning

Non-fatal advisory; the socket stays open. Currently emitted when a turn is auto-ended after the idle timeout because no flush was sent:

{
  "warning": "Turn ended after 5s of inactivity. Send {\"flush\": true} to end a turn explicitly — it lowers latency and avoids this auto-flush."
}

Final (End of Audio)

Sent after the last audio frame of every gracefully completed turn (explicit flush, close, or idle auto-flush), right before session_closed. Once you receive it, no further audio for the turn will arrive — the equivalent of ElevenLabs’ isFinal. It is not sent after a cancel (barge-in); that path acknowledges with interrupted instead.

{
  "final": true,
  "total_audio_seconds": 5.4,
  "total_text_chunks": 3,
  "total_audio_chunks": 15
}

Use final to stop waiting for audio (e.g. to end playback or hang up a call); use the session_closed frame that follows for usage/billing data.

Session Closed

Sent at the end of every turn (on flush, idle auto-flush, or close). The socket stays open for the next turn.

{
  "session_closed": true,
  "total_audio_seconds": 5.4,
  "total_text_chunks": 3,
  "total_audio_chunks": 15,
  "usage": {
    "audio_seconds": 5.4,
    "characters": 142,
    "cost_cents": 0.49,
    "currency": "eur",
    "model_id": "kugel-3"
  }
}

The usage object reports the session’s consumed audio time and the actual amount charged (EUR cents) so you can bill per conversation — same fields as the /ws/tts final message. cost_cents is null with cost_unavailable: true if the charge can’t be determined (never a silent 0).

Example

import asyncio
import websockets
import json
import base64

async def stream_from_llm(llm_tokens):
    uri = "wss://api.kugelaudio.com/ws/tts/stream?api_key=YOUR_API_KEY"

    async with websockets.connect(uri) as ws:
        # Send config
        await ws.send(json.dumps({
            "voice_id": 1071,
            "model_id": "kugel-3",
            "cfg_scale": 2.0,
        }))

        # Stream tokens
        for token in llm_tokens:
            await ws.send(json.dumps({"text": token}))

            # Check for audio (non-blocking)
            try:
                message = await asyncio.wait_for(ws.recv(), timeout=0.01)
                data = json.loads(message)
                if "audio" in data:
                    audio_bytes = base64.b64decode(data["audio"])
                    play_audio(audio_bytes)
            except asyncio.TimeoutError:
                pass

        # Flush ends the turn (emits session_closed); close_socket ends the connection.
        # For a multi-turn conversation, skip close_socket and just send the next
        # turn's text after session_closed — the config above stays in effect.
        await ws.send(json.dumps({"flush": True}))
        await ws.send(json.dumps({"close_socket": True}))

        # Receive remaining audio
        async for message in ws:
            data = json.loads(message)
            if "audio" in data:
                audio_bytes = base64.b64decode(data["audio"])
                play_audio(audio_bytes)
            if data.get("session_closed"):
                usage = data.get("usage", {})
                # Per-session usage: audio time + actual charge (EUR cents)
                print(f"Usage: {usage.get('audio_seconds')}s, {usage.get('cost_cents')} ct")
                break

# Example usage
tokens = ["Hello, ", "this ", "is ", "streaming ", "from ", "an ", "LLM."]
asyncio.run(stream_from_llm(tokens))

Errors

See Error Codes for the full TTS error lookup table, including HTTP status codes, WebSocket close codes, and rate-limit behavior.

​Connection

​Protocol

​Messages

​Config Message

​Text Message

​Flush Message

​Close Message

​Cancel Message (barge-in)

​Response Messages

​Generation Started

​Audio Chunk

​Word Timestamps (when word_timestamps: true)

​Chunk Complete

​Interrupted

​Warning

​Final (End of Audio)

​Session Closed

​Example

​Errors

Connection

Protocol

Messages

Config Message

Text Message

Flush Message

Close Message

Cancel Message (barge-in)

Response Messages

Generation Started

Audio Chunk

Word Timestamps (when `word_timestamps: true`)

Chunk Complete

Interrupted

Warning

Final (End of Audio)

Session Closed

Example

Errors