Stream Speech - KugelAudio

Stream audio chunks as they’re generated for lower latency. One request per connection cycle — for token-by-token text input and multi-turn sessions, use Stream Input.

WebSocket

Connection

Connect with your API key:

wss://api.kugelaudio.com/ws/tts?api_key=YOUR_API_KEY

Request Message

Send a JSON message to start generation. Fields share the meaning and defaults of the Generate Speech parameters:

{
  "text": "Hello, this is streaming audio.",
  "model_id": "kugel-3",
  "voice_id": 1071,
  "cfg_scale": 2.0,
  "normalize": true,
  "language": "en",
  "speed": 1.0
}

word_timestamps

boolean

default:"false"

Enable word-level timestamp alignment. When enabled, a word_timestamps message is sent after the audio chunks with per-word timing data.

speed

number

default:"1.0"

Playback speed multiplier. Range: 0.8 (20% slower) to 1.2 (20% faster). Uses pitch-preserving WSOLA.

dictionary_ids

integer[]

Per-request dictionary selection. Omitted = all active dictionaries (language-filtered); [] = none; a list applies exactly those dictionaries (including inactive ones), bypassing the language filter. Also accepted in the config of /ws/tts/stream and /ws/tts/multi, where it is sticky for the session.

speaker_prefix

boolean

default:"true"

Prepend an internal speaker prefix to the text for better voice consistency.

Text Normalization: Set normalize: true to convert numbers, dates, and symbols to spoken words. Always specify language to ensure correct normalization — auto-detection may produce incorrect results for short texts.

Spell Tags in Streaming: You can use <spell> tags even when streaming text token-by-token. The system automatically buffers text until spell tags are complete before generating audio. If a stream ends with an incomplete tag (e.g., connection drops), the tag is auto-closed.

Response Messages

Audio Chunk

{
  "audio": "base64_encoded_pcm16_data",
  "enc": "pcm_s16le",
  "idx": 0,
  "sr": 24000,
  "samples": 4800
}

Field-by-field reference: Audio formats.

Word Timestamps (when `word_timestamps: true`)

{
  "word_timestamps": [
    {"word": "Hello", "start_ms": 0, "end_ms": 320, "char_start": 0, "char_end": 5, "score": 0.98}
  ]
}

Final Message

On this endpoint, final is the request-complete message and carries the request’s stats and usage. (The streaming endpoints emit a lighter end-of-audio final without usage, followed by session_closed — see Turn lifecycle.)

{
  "final": true,
  "chunks": 10,
  "total_samples": 48000,
  "dur_ms": 2000,
  "gen_ms": 150,
  "rtf": 0.075,
  "usage": {
    "audio_seconds": 2.0,
    "characters": 31,
    "cost_cents": 0.18,
    "currency": "eur",
    "model_id": "kugel-3"
  }
}

Field	Type	Description
`final`	boolean	Indicates generation complete
`chunks`	integer	Number of chunks generated
`total_samples`	integer	Total audio samples generated
`dur_ms`	number	Total audio duration in ms
`gen_ms`	number	Total generation time in ms
`rtf`	number	Real-time factor (gen_ms / dur_ms)

The usage object reports what this request consumed and what it was charged, so you can bill your own customers per request:

Field	Description
`audio_seconds`	Audio generated for this request (the unit we bill on)
`characters`	Input characters submitted
`cost_cents`	Actual amount charged, in EUR cents. `null` (with `cost_unavailable: true`) if the charge could not be determined — never a misleading `0`
`currency`	Currency of `cost_cents` (`"eur"`); present only when `cost_cents` is set
`model_id`	Model that produced the audio

Example

import asyncio
import websockets
import json
import base64

async def stream_tts():
    uri = "wss://api.kugelaudio.com/ws/tts?api_key=YOUR_API_KEY"
    audio_chunks = []

    async with websockets.connect(uri) as ws:
        # Send request
        await ws.send(json.dumps({
            "text": "Hello, this is streaming audio.",
            "model_id": "kugel-3",
            "voice_id": 1071,
            "cfg_scale": 2.0,
        }))

        # Receive chunks
        async for message in ws:
            data = json.loads(message)

            if "audio" in data:
                audio_chunks.append(base64.b64decode(data["audio"]))
                print(f"Chunk {data['idx']}: {data['samples']} samples")

            if data.get("final"):
                print(f"Complete: {data['dur_ms']}ms audio in {data['gen_ms']}ms")
                usage = data.get("usage", {})
                # cost_cents is the actual charge (EUR cents); None if unavailable
                print(f"Usage: {usage.get('audio_seconds')}s, {usage.get('cost_cents')} ct")
                break

asyncio.run(stream_tts())

Errors

WebSocket error frames use the same JSON error shape as HTTP responses:

{
  "error": "Rate limit exceeded",
  "error_code": "RATE_LIMITED",
  "code": 429
}

WebSocket close codes are separate from the JSON code. See Error Codes for the full lookup table.

​Connection

​Request Message

​Response Messages

​Audio Chunk

​Word Timestamps (when word_timestamps: true)

​Final Message

​Example

​Errors

Connection

Request Message

Response Messages

Audio Chunk

Word Timestamps (when `word_timestamps: true`)

Final Message

Example

Errors