Skip to main content
Stream audio chunks as they’re generated for lower latency. One request per connection cycle — for token-by-token text input and multi-turn sessions, use Stream Input.
WebSocket

Connection

Connect with your API key:
wss://api.kugelaudio.com/ws/tts?api_key=YOUR_API_KEY

Request Message

Send a JSON message to start generation. Fields share the meaning and defaults of the Generate Speech parameters:
{
  "text": "Hello, this is streaming audio.",
  "model_id": "kugel-3",
  "voice_id": 1071,
  "cfg_scale": 2.0,
  "normalize": true,
  "language": "en",
  "speed": 1.0
}
word_timestamps
boolean
default:"false"
Enable word-level timestamp alignment. When enabled, a word_timestamps message is sent after the audio chunks with per-word timing data.
speed
number
default:"1.0"
Playback speed multiplier. Range: 0.8 (20% slower) to 1.2 (20% faster). Uses pitch-preserving WSOLA.
dictionary_ids
integer[]
Per-request dictionary selection. Omitted = all active dictionaries (language-filtered); [] = none; a list applies exactly those dictionaries (including inactive ones), bypassing the language filter. Also accepted in the config of /ws/tts/stream and /ws/tts/multi, where it is sticky for the session.
speaker_prefix
boolean
default:"true"
Prepend an internal speaker prefix to the text for better voice consistency.
Text Normalization: Set normalize: true to convert numbers, dates, and symbols to spoken words. Always specify language to ensure correct normalization — auto-detection may produce incorrect results for short texts.
Spell Tags in Streaming: You can use <spell> tags even when streaming text token-by-token. The system automatically buffers text until spell tags are complete before generating audio. If a stream ends with an incomplete tag (e.g., connection drops), the tag is auto-closed.

Response Messages

Audio Chunk

{
  "audio": "base64_encoded_pcm16_data",
  "enc": "pcm_s16le",
  "idx": 0,
  "sr": 24000,
  "samples": 4800
}
Field-by-field reference: Audio formats.

Word Timestamps (when word_timestamps: true)

{
  "word_timestamps": [
    {"word": "Hello", "start_ms": 0, "end_ms": 320, "char_start": 0, "char_end": 5, "score": 0.98}
  ]
}

Final Message

On this endpoint, final is the request-complete message and carries the request’s stats and usage. (The streaming endpoints emit a lighter end-of-audio final without usage, followed by session_closed — see Turn lifecycle.)
{
  "final": true,
  "chunks": 10,
  "total_samples": 48000,
  "dur_ms": 2000,
  "gen_ms": 150,
  "rtf": 0.075,
  "usage": {
    "audio_seconds": 2.0,
    "characters": 31,
    "cost_cents": 0.18,
    "currency": "eur",
    "model_id": "kugel-3"
  }
}
FieldTypeDescription
finalbooleanIndicates generation complete
chunksintegerNumber of chunks generated
total_samplesintegerTotal audio samples generated
dur_msnumberTotal audio duration in ms
gen_msnumberTotal generation time in ms
rtfnumberReal-time factor (gen_ms / dur_ms)
The usage object reports what this request consumed and what it was charged, so you can bill your own customers per request:
FieldDescription
audio_secondsAudio generated for this request (the unit we bill on)
charactersInput characters submitted
cost_centsActual amount charged, in EUR cents. null (with cost_unavailable: true) if the charge could not be determined — never a misleading 0
currencyCurrency of cost_cents ("eur"); present only when cost_cents is set
model_idModel that produced the audio

Example

import asyncio
import websockets
import json
import base64

async def stream_tts():
    uri = "wss://api.kugelaudio.com/ws/tts?api_key=YOUR_API_KEY"
    audio_chunks = []

    async with websockets.connect(uri) as ws:
        # Send request
        await ws.send(json.dumps({
            "text": "Hello, this is streaming audio.",
            "model_id": "kugel-3",
            "voice_id": 1071,
            "cfg_scale": 2.0,
        }))

        # Receive chunks
        async for message in ws:
            data = json.loads(message)

            if "audio" in data:
                audio_chunks.append(base64.b64decode(data["audio"]))
                print(f"Chunk {data['idx']}: {data['samples']} samples")

            if data.get("final"):
                print(f"Complete: {data['dur_ms']}ms audio in {data['gen_ms']}ms")
                usage = data.get("usage", {})
                # cost_cents is the actual charge (EUR cents); None if unavailable
                print(f"Usage: {usage.get('audio_seconds')}s, {usage.get('cost_cents')} ct")
                break

asyncio.run(stream_tts())

Errors

WebSocket error frames use the same JSON error shape as HTTP responses:
{
  "error": "Rate limit exceeded",
  "error_code": "RATE_LIMITED",
  "code": 429
}
WebSocket close codes are separate from the JSON code. See Error Codes for the full lookup table.