Generate Speech
Generate audio from text. Returns complete audio after generation.POST
Request Body
The text to convert to speech. Maximum 10,000 characters.
The model to use. Options:
kugel-1-turbo, kugel-1The voice ID to use. If not specified, uses the default voice.
Classifier-free guidance scale. Range: 0.0-10.0. Higher values = more expressive.
Maximum tokens to generate. Range: 1-8192. Limits output length.
Output sample rate in Hz. Options: 8000, 16000, 22050, 24000.Audio is generated natively at 24kHz. Lower rates use server-side resampling with minimal latency impact (~0.1ms per chunk).
Enable text normalization (converts numbers, dates, etc. to spoken words).Always specify the
language parameter to ensure correct normalization — auto-detection may produce incorrect results for short texts.ISO 639-1 language code for text normalization (e.g., ‘de’, ‘en’, ‘fr’).Supported: de, en, fr, es, it, pt, nl, pl, sv, da, no, fi, cs, hu, ro, el, uk, bg, tr, vi, ar, hi, zh, ja, ko, sk, sl, hr, sr, ru, he, fa, ur, bn, ta, yue, th, id, msIf not provided and
normalize is true, language will be auto-detected. Auto-detection may produce incorrect normalizations for short texts or languages that share similar vocabulary.Playback speed multiplier. Range:
0.8 (20% slower) to 1.2 (20% faster).Uses pitch-preserving time-stretching (WSOLA) so the voice pitch stays natural at any speed.
Inline <prosody rate="slow|fast|..."> tags can be used for per-segment control within a single request.Spell Tags
Use<spell> tags to spell out text letter by letter. This is useful for:
- Email addresses
- Acronyms and abbreviations
- Serial numbers or codes
- Any text that should be pronounced character by character
Spell tags require
normalize: true. Special characters are translated to language-specific words:- English:
@→ “at”,.→ “dot” - German:
@→ “ät”,.→ “Punkt” - French:
@→ “arobase”,.→ “point”
Response
Returns raw PCM16 audio as a streaming binary response (audio/pcm).
AI watermark (EU AI Act Art. 50): All generated audio is automatically watermarked using AudioSeal, an imperceptible neural watermark. This is required under EU AI Act Article 50 for AI-generated audio content. The watermark is inaudible and does not affect audio quality.
| Header | Value | Description |
|---|---|---|
Content-Type | audio/pcm | Raw PCM audio stream |
X-Sample-Rate | 24000 | Sample rate of the audio |
X-Audio-Format | pcm_s16le | Audio encoding format |
Example
Stream Speech (WebSocket)
Stream audio chunks as they’re generated for lower latency.Audio generated via WebSocket endpoints is also watermarked (EU AI Act Art. 50). See the Generate Speech section for details.
WebSocket
Connection
Connect with your API key:Request Message
Send a JSON message to start generation:Enable word-level timestamp alignment. When enabled, a
word_timestamps message is sent after the audio chunks with per-word timing data.Playback speed multiplier. Range:
0.8 (20% slower) to 1.2 (20% faster). Uses pitch-preserving WSOLA.Prepend an internal speaker prefix to the text for better voice consistency.
Text Normalization: Set
normalize: true to convert numbers, dates, and symbols to spoken words.
Always specify language to ensure correct normalization — auto-detection may produce incorrect results for short texts.Response Messages
Audio Chunk
Word Timestamps (when word_timestamps: true)
Final Message
Example
Stream Input (WebSocket)
Stream text input token-by-token for LLM integration.WebSocket
Connection
Protocol
- Send config: Initial configuration message
- Send text: Text chunks as they arrive
- Send flush: Force generation of buffered text
- Send close: End the session
- Receive audio: Audio chunks as they’re generated
Messages
Config Message
Text Message
Flush Message
Close Message
Response Messages
Generation Started
Audio Chunk
Word Timestamps (when word_timestamps: true)
Chunk Complete
Session Closed
Example
Multi-Context Streaming (WebSocket)
Manage up to 5 independent audio streams over a single WebSocket connection. Useful for multi-speaker conversations, pre-buffering, and interleaved audio.WebSocket
Connection
Client → Server Messages
| Message | Description |
|---|---|
{"text": " ", "context_id": "ctx1", "voice_settings": {"voice_id": 123}} | Initialize context with voice |
{"text": "Hello", "context_id": "ctx1"} | Send text to context |
{"text": "...", "context_id": "ctx1", "flush": true} | Send text and flush buffer |
{"flush": true, "context_id": "ctx1"} | Flush context buffer |
{"close_context": true, "context_id": "ctx1"} | Close specific context |
{"close_socket": true} | Close all contexts and connection |
Server → Client Messages
| Message | Description |
|---|---|
{"context_created": true, "context_id": "ctx1"} | Context created |
{"generation_started": true, "context_id": "ctx1", "chunk_id": 0, "text": "..."} | Generation started |
{"audio": "base64...", "enc": "pcm_s16le", "context_id": "ctx1", "idx": 0, "sr": 24000, "samples": 4800, "chunk_id": 0} | Audio chunk |
{"chunk_complete": true, "context_id": "ctx1", "chunk_id": 0, "audio_seconds": 1.2, "gen_ms": 150} | Chunk complete |
{"word_timestamps": [...], "context_id": "ctx1", "chunk_id": 0} | Word-level time alignments (when enabled) |
{"is_final": true, "context_id": "ctx1"} | All generation complete for context |
{"context_closed": true, "context_id": "ctx1"} | Context closed |
{"session_closed": true, "total_audio_seconds": 5.4} | Session ended |
Voice Settings
When creating a context, pass voice settings as a nested object:Session-Level Config
These options can be set on any message and apply to the entire session:| Parameter | Type | Default | Description |
|---|---|---|---|
model_id | string | kugel-1-turbo | Model to use for generation |
sample_rate | integer | 24000 | Output sample rate in Hz. Options: 8000, 16000, 22050, 24000 |
normalize | boolean | true | Enable text normalization |
language | string | - | ISO 639-1 language code for normalization |
word_timestamps | boolean | false | Enable word-level timestamp alignment |
Example
Limits
- Maximum 5 concurrent contexts per connection
- Contexts auto-close after 20 seconds of inactivity
Response Fields
Audio Chunk Fields (WebSocket)
| Field | Type | Description |
|---|---|---|
audio | string | Base64-encoded PCM16 audio data |
enc | string | Audio encoding (always pcm_s16le) |
idx | integer | Chunk index (0-based) |
sr | integer | Sample rate in Hz |
samples | integer | Number of samples in this chunk |
chunk_id | integer | Text chunk ID (present on /ws/tts/stream and /ws/tts/multi) |
context_id | string | Context identifier (present on /ws/tts/multi) |
Streaming Stats
| Field | Type | Description |
|---|---|---|
final | boolean | Indicates generation complete |
chunks | integer | Number of chunks generated |
total_samples | integer | Total audio samples generated |
dur_ms | number | Total audio duration in ms |
gen_ms | number | Total generation time in ms |
rtf | number | Real-time factor (gen_ms / dur_ms) |