KugelAudio generates speech directly from text. There is no voice direction layer — instead you shape the output by how you write the input. This page covers every mechanism available to control pronunciation, pacing, and emphasis.
| Tag | Purpose | Requires normalize |
|---|
<spell>text</spell> | Spell out characters one by one | Yes |
<prosody rate="slow|medium|fast|0.8–1.2">text</prosody> | Adjust speed of a text span | No |
These are the only tags processed. Everything else is stripped before synthesis — see Unsupported Tags below.
<spell> — Character-by-Character Pronunciation
Wrapping text in <spell> tags causes each character to be read out individually. Useful for email addresses, codes, acronyms, and serial numbers.
"Contact us at <spell>hello@kugelaudio.com</spell>"
→ "Contact us at H, E, L, L, O, at, K, U, G, E, L, A, U, D, I, O, dot, C, O, M"
normalize: true must be enabled for spell tags to work. Special characters (@, ., -, _) are translated to language-specific spoken words.
Character translations by language
| Character | English | German | French | Spanish |
|---|
@ | at | ät | arobase | arroba |
. | dot | Punkt | point | punto |
- | dash | Strich | tiret | guión |
_ | underscore | Unterstrich | underscore | guión bajo |
Examples
# Email address
audio = client.tts.generate(
text="Email us at <spell>hello@kugelaudio.com</spell>",
normalize=True,
language="en",
)
# Verification code
audio = client.tts.generate(
text="Your code is <spell>A4-B9-XZ</spell>",
normalize=True,
language="en",
)
# Acronym with context
audio = client.tts.generate(
text="We use <spell>TTS</spell>, text-to-speech, for audio output.",
normalize=True,
language="en",
)
// Email address
const audio = await client.tts.generate({
text: 'Email us at <spell>hello@kugelaudio.com</spell>',
normalize: true,
language: 'en',
});
// Verification code
const audio2 = await client.tts.generate({
text: 'Your code is <spell>A4-B9-XZ</spell>',
normalize: true,
language: 'en',
});
curl -X POST https://api.kugelaudio.com/v1/tts/generate \
-H "Authorization: Bearer $KUGELAUDIO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Your code is <spell>A4-B9-XZ</spell>",
"normalize": true,
"language": "en"
}' --output output.pcm
For clearer letter-by-letter pronunciation use kugel-1 over kugel-1-turbo.
<prosody rate> — Inline Speed Control
Slow down or speed up a specific span of text without affecting the rest of the sentence. The tag is stripped before synthesis and the inner text is time-stretched after generation.
"Call us at <prosody rate="slow">0 30 12 34 56 78</prosody> during business hours."
Rate values
| Value | Speed | Alias |
|---|
"slow" | 0.8× (20% slower) | — |
"medium" | 1.0× (normal) | — |
"fast" | 1.2× (20% faster) | — |
"0.8" – "1.2" | exact multiplier | numeric |
Values outside 0.8–1.2 are clamped. The speed request parameter sets a global default; <prosody rate> overrides it per-span.
Examples
# Slow down a phone number
audio = client.tts.generate(
text='Call us at <prosody rate="slow">0 800 123 456</prosody> any time.',
language="de",
)
# Mix speeds in one sentence
audio = client.tts.generate(
text='<prosody rate="fast">Limited time offer!</prosody> '
'Your confirmation code is <prosody rate="slow">X7-K2-9P</prosody>.',
normalize=True,
language="en",
)
# Numeric rate
audio = client.tts.generate(
text='Details: <prosody rate="0.85">Article number 4 dash 0 0 7.</prosody>',
language="en",
)
// Slow down a phone number
const audio = await client.tts.generate({
text: 'Call us at <prosody rate="slow">0 800 123 456</prosody> any time.',
language: 'de',
});
// Mix speeds in one sentence
const audio2 = await client.tts.generate({
text: '<prosody rate="fast">Limited time offer!</prosody> '
+ 'Your confirmation code is <prosody rate="slow">X7-K2-9P</prosody>.',
normalize: true,
language: 'en',
});
curl -X POST https://api.kugelaudio.com/v1/tts/generate \
-H "Authorization: Bearer $KUGELAUDIO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Call us at <prosody rate=\"slow\">0 800 123 456</prosody> any time.",
"language": "de"
}' --output output.pcm
Global Speed Parameter
The speed request parameter applies a uniform speed to the entire synthesis. Use it when you want all output at a consistent rate.
audio = client.tts.generate(
text="This entire sentence is read 20% faster.",
speed=1.2,
)
speed | Effect | Typical use |
|---|
0.8 | 20% slower | Dictation, phone numbers, legal disclaimers |
1.0 | Normal (default) | General purpose |
1.2 | 20% faster | Notifications, fast-paced UI feedback |
<prosody rate> tags take precedence over the global speed for their spans.
Text Normalization
When normalize: true, numbers, dates, times, currencies, and units are converted to spoken words before synthesis.
| Input | Spoken output (English) |
|---|
3 items | three items |
€50.99 | fifty euros and ninety-nine cents |
01/15/2024 | January fifteenth twenty twenty-four |
2:30 PM | two thirty PM |
100km/h | one hundred kilometres per hour |
Always set language explicitly when using normalization. Auto-detection can produce incorrect results for short texts or languages with shared vocabulary.
Punctuation as Pacing
The model respects natural punctuation cues — no special tags needed:
| Technique | Effect |
|---|
, comma | Brief pause between clauses |
. period | Sentence-end pause, falling intonation |
… ellipsis | Longer trailing pause |
— em dash | Abrupt pause / interruption feel |
? question mark | Rising intonation |
! exclamation | Energetic delivery |
\n newline | Paragraph-level pause (similar to period) |
These are the recommended way to add natural rhythm. There is no <break> tag support.
LLM System Prompt Patterns
When an LLM generates text that feeds directly into TTS, add instructions so it uses supported tags correctly:
You are a voice assistant. Format your responses for text-to-speech output:
- For email addresses and codes, use <spell> tags:
"Your code is <spell>ABC-123</spell>"
- For phone numbers or content that should be read slowly, use prosody tags:
"Call <prosody rate="slow">0 800 555 1234</prosody>"
- Do NOT use markdown formatting (**, *, #, -, bullet points) — it will be read aloud literally.
- Do NOT use emoji.
- Keep sentences short. End with punctuation.
- Write numbers as digits when they should be normalized: "You have 3 messages."
Strip markdown from LLM output before passing it to TTS. Asterisks, hashes, and bullet characters are read literally by the model.
KugelAudio processes <spell> and <prosody rate> only. All other tags are silently stripped — the inner text is kept but the tag itself has no effect.
| Tag / Attribute | Status | Alternative |
|---|
<speak> wrapper | Stripped | Omit — plain text is assumed |
<prosody pitch="..."> | Stripped | No pitch control available |
<prosody volume="..."> | Stripped | No volume control available |
<prosody duration="..."> | Stripped | Use speed parameter instead |
<emphasis> | Stripped | Rephrase text for natural emphasis |
<break time="..."> | Stripped | Use punctuation (., ,, …) |
<say-as interpret-as="..."> | Stripped | Use <spell> for characters, normalization for numbers |
<sub alias="..."> | Stripped | Write the spoken form directly in the text |
<audio>, <p>, <s>, <w>, <lang> | Stripped | — |
Unknown tags are not validated at request time. Passing unsupported tags will not return an error — the tags are removed and the remaining text is synthesized. Test your output when migrating from a full-SSML provider like Google Cloud TTS, Amazon Polly, or Microsoft Azure.
Next Steps