Skip to main content
Generate audio from text. Returns complete audio after generation.
POST

Request Body

This is the canonical parameter reference for TTS generation. The WebSocket endpoints accept the same fields (plus their own session controls — see Stream Input).
text
string
required
The text to convert to speech. Maximum 10,000 characters. Supports inline <break> and <spell> tags; all other tags are stripped (see Prompting).
model_id
string
default:"kugel-3"
The model to use. Use kugel-3 for new integrations. Legacy IDs such as kugel-2.5 and kugel-2-turbo remain accepted for backwards compatibility. Accepted model IDs are billed and shown in Dashboard usage as requested, even when they route through the current production model.
voice_id
integer
required
The voice ID to use. Required — there is no default voice. A request without a voice_id is rejected with 400 MISSING_VOICE_ID; a voice_id that doesn’t exist (or isn’t visible to your API key) returns 404 NOT_FOUND.
cfg_scale
number
default:"2.0"
Classifier-free guidance scale. Range: 0.0-10.0. Higher values = more expressive.
temperature
number
default:"0.4"
Sampling variance (0.0–1.0). 0 = most stable, 1 = most variance. See temperature guidance.
max_new_tokens
integer
default:"2048"
Maximum tokens to generate. Range: 1-8192. Limits output length.
sample_rate
integer
default:"24000"
Output sample rate in Hz. Options: 8000, 16000, 22050, 24000.Audio is generated natively at 24kHz. Lower rates use server-side resampling with negligible latency impact (see Latency).
output_format
string
Combined codec + rate token (e.g. ulaw_8000) for non-PCM output such as G.711 telephony codecs. Opt-in; when set it is authoritative and must not contradict sample_rate. See Audio formats.
normalize
boolean
default:"true"
Enable text normalization (converts numbers, dates, etc. to spoken words).Always specify the language parameter to ensure correct normalization — auto-detection may produce incorrect results for short texts.
language
string
ISO 639-1 language code for text normalization (e.g., ‘de’, ‘en’, ‘fr’).Supported: de, en, fr, es, it, pt, nl, pl, sv, da, no, fi, cs, hu, ro, el, uk, bg, tr, vi, ar, hi, zh, ja, ko, sk, sl, hr, sr, ru, he, fa, ur, bn, ta, yue, th, id, msIf not provided and normalize is true, language will be auto-detected. Auto-detection may produce incorrect normalizations for short texts or languages that share similar vocabulary.
word_timestamps
boolean
default:"false"
WebSocket endpoints only. Enable word-level timestamp alignment — see Word timestamps. Not accepted by this REST endpoint: requests are strictly validated, so sending it here returns 422 Unprocessable Entity. Use a WebSocket endpoint or an SDK instead.
speaker_prefix
boolean
default:"true"
WebSocket endpoints only. Prepend an internal speaker prefix to the text for better voice consistency. Not accepted by this REST endpoint (strict validation returns 422).
speed
number
default:"1.0"
Playback speed multiplier. Range: 0.8 (20% slower) to 1.2 (20% faster).Uses pitch-preserving time-stretching (WSOLA) so the voice pitch stays natural at any speed. Applies to the whole request; wrap text in <prosody rate="slow|medium|fast|0.8-1.2"> to override the rate for a span (see Speed).
dictionary_ids
integer[]
Per-request dictionary selection.
  • Omitted — default: all active dictionaries of the project apply, filtered by language.
  • [] — no dictionary applies to this request.
  • [7, 9] — exactly those dictionaries apply, including inactive ones, bypassing the language filter.
IDs must belong to the request’s project; unknown IDs return a 400 before generation starts. Maximum 50 IDs.

Temperature guidance

temperature controls how much the sampler varies across regenerations of the same text. Lower values are closer to greedy decoding (stable, repeatable reads); higher values are more expressive but less consistent.
Use caseSuggested range
E-learning, IVR prompts, compliance reads0.00.3
General voiceover, conversational UX0.40.6 (default 0.4)
Expressive narration, ads, character voices0.71.0
The default of 0.4 tracks the TTS Studio natural preset. Lowered from 0.5 to reduce intermittent word-drop on short trailing sentences with kugel-3.

Spell Tags

Use <spell> tags to spell out text letter by letter. This is useful for:
  • Email addresses
  • Acronyms and abbreviations
  • Serial numbers or codes
  • Any text that should be pronounced character by character
{
  "text": "My email is <spell>kajo@kugelaudio.com</spell>",
  "normalize": true,
  "language": "en"
}
Output: “My email is K, A, J, O, at, K, U, G, E, L, A, U, D, I, O, dot, C, O, M”
Spell tags require normalize: true. Special characters are translated to language-specific words:
  • English: @ → “at”, . → “dot”
  • German: @ → “ät”, . → “Punkt”
  • French: @ → “arobase”, . → “point”
Model recommendation: Use kugel-3 for the best current spelling, prosody, and break tag support.

Response

Returns raw PCM16 audio as a streaming binary response (audio/pcm). For encoding details, non-PCM output, and the watermark, see Audio formats. Response Headers:
HeaderValueDescription
Content-Typeaudio/pcmRaw PCM audio stream
X-Sample-Rate24000Sample rate of the audio
X-Audio-Formatpcm_s16leAudio encoding format
The response body is raw PCM 16-bit signed little-endian audio data streamed as binary chunks.

Example

Non-ASCII characters (umlauts, accents, CJK, etc.): When calling the API directly (without an SDK), make sure to:
  1. Set the header Content-Type: application/json; charset=utf-8
  2. Normalize your text to Unicode NFC before sending (e.g. unicodedata.normalize("NFC", text) in Python)
  3. Always set the language parameter explicitly (e.g. "de", "fr") — relying on auto-detection can produce incorrect normalizations, especially for short texts
Without these steps, characters like ä, ö, ü, ß or accented letters may be garbled or mispronounced. Our SDKs handle this automatically.
curl -X POST "https://api.kugelaudio.com/v1/tts/generate" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json; charset=utf-8" \
  -d '{
    "text": "Hello, this is a test of the KugelAudio API.",
    "model_id": "kugel-3",
    "voice_id": 1071,
    "cfg_scale": 2.0
  }'

Errors

See Error Codes for the full TTS error lookup table, including HTTP status codes, WebSocket close codes, and rate-limit behavior.

Stream Speech

Same request, audio chunks streamed over a WebSocket

Stream Input

Token-by-token text input for LLM agents