Generate Speech
Generate audio from text. Returns complete audio after generation.POST
Request Body
The text to convert to speech. Maximum length depends on the model.
The model to use. Options:
kugel-1-turbo, kugel-1The voice ID to use. If not specified, uses the default voice.
Classifier-free guidance scale. Range: 1.0-5.0. Higher values = more expressive.
Maximum tokens to generate. Limits output length.
Output sample rate in Hz. Options: 8000, 16000, 22050, 24000.Audio is generated natively at 24kHz. Lower rates use server-side resampling with minimal latency impact (~0.1ms per chunk).
Enable text normalization (converts numbers, dates, etc. to spoken words).For best performance, always specify the
language parameter to skip auto-detection (~150ms latency).ISO 639-1 language code for text normalization (e.g., ‘de’, ‘en’, ‘fr’).Supported: de, en, fr, es, it, pt, nl, pl, sv, da, no, fi, cs, hu, ro, el, uk, bg, tr, vi, ar, hi, zh, ja, koIf not provided and
normalize is true, language will be auto-detected (adds ~150ms latency).Spell Tags
Use<spell> tags to spell out text letter by letter. This is useful for:
- Email addresses
- Acronyms and abbreviations
- Serial numbers or codes
- Any text that should be pronounced character by character
Spell tags require
normalize: true. Special characters are translated to language-specific words:- English:
@→ “at”,.→ “dot” - German:
@→ “ät”,.→ “Punkt” - French:
@→ “arobase”,.→ “point”
Response
Returns raw PCM16 audio as a streaming binary response (audio/pcm).
Response Headers:
| Header | Value | Description |
|---|---|---|
Content-Type | audio/pcm | Raw PCM audio stream |
X-Sample-Rate | 24000 | Sample rate of the audio |
X-Audio-Format | pcm_s16le | Audio encoding format |
Example
Stream Speech (WebSocket)
Stream audio chunks as they’re generated for lower latency.WebSocket
Connection
Connect with your API key:Request Message
Send a JSON message to start generation:Text Normalization: Set
normalize: true to convert numbers, dates, and symbols to spoken words.
Always specify language to avoid ~150ms auto-detection latency.Response Messages
Audio Chunk
Final Message
Example
Stream Input (WebSocket)
Stream text input token-by-token for LLM integration.WebSocket
Connection
Protocol
- Send config: Initial configuration message
- Send text: Text chunks as they arrive
- Send flush: Force generation of buffered text
- Send close: End the session
- Receive audio: Audio chunks as they’re generated
Messages
Config Message
Text Message
Flush Message
Close Message
Example
Multi-Context Streaming (WebSocket)
Manage up to 5 independent audio streams over a single WebSocket connection. Useful for multi-speaker conversations, pre-buffering, and interleaved audio.WebSocket
Connection
Client → Server Messages
| Message | Description |
|---|---|
{"text": " ", "context_id": "ctx1", "voice_settings": {"voice_id": 123}} | Initialize context with voice |
{"text": "Hello", "context_id": "ctx1"} | Send text to context |
{"text": "...", "context_id": "ctx1", "flush": true} | Send text and flush buffer |
{"flush": true, "context_id": "ctx1"} | Flush context buffer |
{"close_context": true, "context_id": "ctx1"} | Close specific context |
{"close_socket": true} | Close all contexts and connection |
Server → Client Messages
| Message | Description |
|---|---|
{"context_created": true, "context_id": "ctx1"} | Context created |
{"generation_started": true, "context_id": "ctx1", "chunk_id": 0, "text": "..."} | Generation started |
{"audio": "base64...", "enc": "pcm_s16le", "context_id": "ctx1", "idx": 0, "sr": 24000} | Audio chunk |
{"chunk_complete": true, "context_id": "ctx1", "chunk_id": 0, "audio_seconds": 1.2} | Chunk complete |
{"is_final": true, "context_id": "ctx1"} | All generation complete for context |
{"context_closed": true, "context_id": "ctx1"} | Context closed |
{"session_closed": true, "total_audio_seconds": 5.4} | Session ended |
Voice Settings
When creating a context, pass voice settings as a nested object:Example
Limits
- Maximum 5 concurrent contexts per connection
- Contexts auto-close after 20 seconds of inactivity
Response Fields
Audio Chunk Fields (WebSocket)
| Field | Type | Description |
|---|---|---|
audio | string | Base64-encoded PCM16 audio data |
enc | string | Audio encoding (always pcm_s16le) |
idx | integer | Chunk index (0-based) |
sr | integer | Sample rate in Hz |
samples | integer | Number of samples in this chunk |
Streaming Stats
| Field | Type | Description |
|---|---|---|
final | boolean | Indicates generation complete |
chunks | integer | Number of chunks generated |
total_samples | integer | Total audio samples generated |
dur_ms | number | Total audio duration in ms |
gen_ms | number | Total generation time in ms |
rtf | number | Real-time factor (gen_ms / dur_ms) |