POST
Request Body
This is the canonical parameter reference for TTS generation. The WebSocket endpoints accept the same fields (plus their own session controls — see Stream Input).The model to use. Use
kugel-3 for new integrations. Legacy IDs such as kugel-2.5 and kugel-2-turbo remain accepted for backwards compatibility. Accepted model IDs are billed and shown in Dashboard usage as requested, even when they route through the current production model.The voice ID to use. Required — there is no default voice. A request without a
voice_id is rejected with 400 MISSING_VOICE_ID; a voice_id that doesn’t
exist (or isn’t visible to your API key) returns 404 NOT_FOUND.Classifier-free guidance scale. Range: 0.0-10.0. Higher values = more expressive.
Sampling variance (0.0–1.0). 0 = most stable, 1 = most variance. See
temperature guidance.
Maximum tokens to generate. Range: 1-8192. Limits output length.
Output sample rate in Hz. Options: 8000, 16000, 22050, 24000.Audio is generated natively at 24kHz. Lower rates use server-side resampling with negligible latency impact (see Latency).
Combined codec + rate token (e.g.
ulaw_8000) for non-PCM output such as
G.711 telephony codecs. Opt-in; when set it is authoritative and must not
contradict sample_rate. See Audio formats.Enable text normalization (converts numbers, dates, etc. to spoken words).Always specify the
language parameter to ensure correct normalization — auto-detection may produce incorrect results for short texts.ISO 639-1 language code for text normalization (e.g., ‘de’, ‘en’, ‘fr’).Supported: de, en, fr, es, it, pt, nl, pl, sv, da, no, fi, cs, hu, ro, el, uk, bg, tr, vi, ar, hi, zh, ja, ko, sk, sl, hr, sr, ru, he, fa, ur, bn, ta, yue, th, id, msIf not provided and
normalize is true, language will be auto-detected. Auto-detection may produce incorrect normalizations for short texts or languages that share similar vocabulary.WebSocket endpoints only. Enable word-level timestamp alignment — see
Word timestamps. Not accepted by this REST
endpoint: requests are strictly validated, so sending it here returns
422 Unprocessable Entity. Use a WebSocket endpoint or an SDK instead.WebSocket endpoints only. Prepend an internal speaker prefix to the text
for better voice consistency. Not accepted by this REST endpoint (strict
validation returns
422).Playback speed multiplier. Range:
0.8 (20% slower) to 1.2 (20% faster).Uses pitch-preserving time-stretching (WSOLA) so the voice pitch stays natural at any speed.
Applies to the whole request; wrap text in <prosody rate="slow|medium|fast|0.8-1.2"> to override
the rate for a span (see Speed).Per-request dictionary selection.
- Omitted — default: all active dictionaries of the project apply, filtered by language.
[]— no dictionary applies to this request.[7, 9]— exactly those dictionaries apply, including inactive ones, bypassing the language filter.
400 before generation starts. Maximum 50 IDs.Temperature guidance
temperature controls how much the sampler varies across regenerations of the
same text. Lower values are closer to greedy decoding (stable, repeatable
reads); higher values are more expressive but less consistent.
| Use case | Suggested range |
|---|---|
| E-learning, IVR prompts, compliance reads | 0.0 – 0.3 |
| General voiceover, conversational UX | 0.4 – 0.6 (default 0.4) |
| Expressive narration, ads, character voices | 0.7 – 1.0 |
0.4 tracks the TTS Studio natural preset. Lowered from 0.5
to reduce intermittent word-drop on short trailing sentences with kugel-3.
Spell Tags
Use<spell> tags to spell out text letter by letter. This is useful for:
- Email addresses
- Acronyms and abbreviations
- Serial numbers or codes
- Any text that should be pronounced character by character
Spell tags require
normalize: true. Special characters are translated to language-specific words:- English:
@→ “at”,.→ “dot” - German:
@→ “ät”,.→ “Punkt” - French:
@→ “arobase”,.→ “point”
Response
Returns raw PCM16 audio as a streaming binary response (audio/pcm).
For encoding details, non-PCM output, and the watermark, see
Audio formats.
Response Headers:
| Header | Value | Description |
|---|---|---|
Content-Type | audio/pcm | Raw PCM audio stream |
X-Sample-Rate | 24000 | Sample rate of the audio |
X-Audio-Format | pcm_s16le | Audio encoding format |
Example
Errors
See Error Codes for the full TTS error lookup table, including HTTP status codes, WebSocket close codes, and rate-limit behavior.Related endpoints
Stream Speech
Same request, audio chunks streamed over a WebSocket
Stream Input
Token-by-token text input for LLM agents