Generate Speech - KugelAudio

Generate audio from text. Returns complete audio after generation.

POST

Request Body

This is the canonical parameter reference for TTS generation. The WebSocket endpoints accept the same fields (plus their own session controls — see Stream Input).

text

string

required

The text to convert to speech. Maximum 10,000 characters. Supports inline <break> and <spell> tags; all other tags are stripped (see Prompting).

model_id

string

default:"kugel-3"

The model to use. Use kugel-3 for new integrations. Legacy IDs such as kugel-2.5 and kugel-2-turbo remain accepted for backwards compatibility. Accepted model IDs are billed and shown in Dashboard usage as requested, even when they route through the current production model.

voice_id

integer

required

The voice ID to use. Required — there is no default voice. A request without a voice_id is rejected with 400 MISSING_VOICE_ID; a voice_id that doesn’t exist (or isn’t visible to your API key) returns 404 NOT_FOUND.

cfg_scale

number

default:"2.0"

Classifier-free guidance scale. Range: 0.0-10.0. Higher values = more expressive.

temperature

number

default:"0.4"

Sampling variance (0.0–1.0). 0 = most stable, 1 = most variance. See temperature guidance.

max_new_tokens

integer

default:"2048"

Maximum tokens to generate. Range: 1-8192. Limits output length.

sample_rate

integer

default:"24000"

Output sample rate in Hz. Options: 8000, 16000, 22050, 24000.Audio is generated natively at 24kHz. Lower rates use server-side resampling with negligible latency impact (see Latency).

output_format

string

Combined codec + rate token (e.g. ulaw_8000) for non-PCM output such as G.711 telephony codecs. Opt-in; when set it is authoritative and must not contradict sample_rate. See Audio formats.

normalize

boolean

default:"true"

Enable text normalization (converts numbers, dates, etc. to spoken words).Always specify the language parameter to ensure correct normalization — auto-detection may produce incorrect results for short texts.

language

string

ISO 639-1 language code for text normalization (e.g., ‘de’, ‘en’, ‘fr’).Supported: de, en, fr, es, it, pt, nl, pl, sv, da, no, fi, cs, hu, ro, el, uk, bg, tr, vi, ar, hi, zh, ja, ko, sk, sl, hr, sr, ru, he, fa, ur, bn, ta, yue, th, id, msIf not provided and normalize is true, language will be auto-detected. Auto-detection may produce incorrect normalizations for short texts or languages that share similar vocabulary.

word_timestamps

boolean

default:"false"

WebSocket endpoints only. Enable word-level timestamp alignment — see Word timestamps. Not accepted by this REST endpoint: requests are strictly validated, so sending it here returns 422 Unprocessable Entity. Use a WebSocket endpoint or an SDK instead.

speaker_prefix

boolean

default:"true"

WebSocket endpoints only. Prepend an internal speaker prefix to the text for better voice consistency. Not accepted by this REST endpoint (strict validation returns 422).

speed

number

default:"1.0"

Playback speed multiplier. Range: 0.8 (20% slower) to 1.2 (20% faster).Uses pitch-preserving time-stretching (WSOLA) so the voice pitch stays natural at any speed. Applies to the whole request; wrap text in <prosody rate="slow|medium|fast|0.8-1.2"> to override the rate for a span (see Speed).

dictionary_ids

integer[]

Per-request dictionary selection.

Omitted — default: all active dictionaries of the project apply, filtered by language.
[] — no dictionary applies to this request.
[7, 9] — exactly those dictionaries apply, including inactive ones, bypassing the language filter.

IDs must belong to the request’s project; unknown IDs return a 400 before generation starts. Maximum 50 IDs.

Temperature guidance

temperature controls how much the sampler varies across regenerations of the same text. Lower values are closer to greedy decoding (stable, repeatable reads); higher values are more expressive but less consistent.

Use case	Suggested range
E-learning, IVR prompts, compliance reads	`0.0` – `0.3`
General voiceover, conversational UX	`0.4` – `0.6` (default `0.4`)
Expressive narration, ads, character voices	`0.7` – `1.0`

The default of 0.4 tracks the TTS Studio natural preset. Lowered from 0.5 to reduce intermittent word-drop on short trailing sentences with kugel-3.

Spell Tags

Use <spell> tags to spell out text letter by letter. This is useful for:

Email addresses
Acronyms and abbreviations
Serial numbers or codes
Any text that should be pronounced character by character

{
  "text": "My email is <spell>kajo@kugelaudio.com</spell>",
  "normalize": true,
  "language": "en"
}

Output: “My email is K, A, J, O, at, K, U, G, E, L, A, U, D, I, O, dot, C, O, M”

Spell tags require normalize: true. Special characters are translated to language-specific words:

English: @ → “at”, . → “dot”
German: @ → “ät”, . → “Punkt”
French: @ → “arobase”, . → “point”

Model recommendation: Use kugel-3 for the best current spelling, prosody, and break tag support.

Response

Returns raw PCM16 audio as a streaming binary response (audio/pcm). For encoding details, non-PCM output, and the watermark, see Audio formats. Response Headers:

Header	Value	Description
`Content-Type`	`audio/pcm`	Raw PCM audio stream
`X-Sample-Rate`	`24000`	Sample rate of the audio
`X-Audio-Format`	`pcm_s16le`	Audio encoding format

The response body is raw PCM 16-bit signed little-endian audio data streamed as binary chunks.

Example

Non-ASCII characters (umlauts, accents, CJK, etc.): When calling the API directly (without an SDK), make sure to:

Set the header Content-Type: application/json; charset=utf-8
Normalize your text to Unicode NFC before sending (e.g. unicodedata.normalize("NFC", text) in Python)
Always set the language parameter explicitly (e.g. "de", "fr") — relying on auto-detection can produce incorrect normalizations, especially for short texts

Without these steps, characters like ä, ö, ü, ß or accented letters may be garbled or mispronounced. Our SDKs handle this automatically.

curl -X POST "https://api.kugelaudio.com/v1/tts/generate" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json; charset=utf-8" \
  -d '{
    "text": "Hello, this is a test of the KugelAudio API.",
    "model_id": "kugel-3",
    "voice_id": 1071,
    "cfg_scale": 2.0
  }'

Errors

See Error Codes for the full TTS error lookup table, including HTTP status codes, WebSocket close codes, and rate-limit behavior.

Stream Speech

Same request, audio chunks streamed over a WebSocket

Stream Input

Token-by-token text input for LLM agents

​Request Body

​Temperature guidance

​Spell Tags

​Response

​Example

​Errors

​Related endpoints

Stream Speech

Stream Input

Request Body

Temperature guidance

Spell Tags

Response

Example

Errors

Related endpoints