KugelAudio provides text processing features to ensure your text is spoken naturally. This includes automatic normalization of numbers, dates, and currencies, as well as the ability to spell out text letter by letter.
Text Normalization
Text normalization converts numbers, dates, times, and other non-verbal text into spoken words:
- “I have 3 apples” → “I have three apples”
- “The meeting is at 2:30 PM” → “The meeting is at two thirty PM”
- “€50.99” → “fifty euros and ninety-nine cents”
Enable normalization by setting normalize=True (Python) or normalize: true (JavaScript):
# With explicit language (recommended - fastest)
audio = client.tts.generate(
text="I bought 3 items for €50.99 on 01/15/2024.",
normalize=True,
language="en",
)
# With auto-detection (adds ~150ms latency)
audio = client.tts.generate(
text="Ich habe 3 Artikel für 50,99€ gekauft.",
normalize=True,
# language not specified - will auto-detect
)
// With explicit language (recommended - fastest)
const audio = await client.tts.generate({
text: 'I bought 3 items for €50.99 on 01/15/2024.',
normalize: true,
language: 'en',
});
// With auto-detection (adds ~150ms latency)
const audio = await client.tts.generate({
text: 'Ich habe 3 Artikel für 50,99€ gekauft.',
normalize: true,
// language not specified - will auto-detect
});
Using normalize without specifying language adds approximately 150ms latency for language auto-detection. For best performance in latency-sensitive applications, always specify the language parameter.
Supported Languages
| Code | Language | Code | Language |
|---|
de | German | nl | Dutch |
en | English | pl | Polish |
fr | French | sv | Swedish |
es | Spanish | da | Danish |
it | Italian | no | Norwegian |
pt | Portuguese | fi | Finnish |
cs | Czech | hu | Hungarian |
ro | Romanian | el | Greek |
uk | Ukrainian | bg | Bulgarian |
tr | Turkish | vi | Vietnamese |
ar | Arabic | hi | Hindi |
zh | Chinese | ja | Japanese |
ko | Korean | | |
Use <spell> tags to spell out text letter by letter. This is useful for email addresses, codes, acronyms, or any text that should be pronounced character by character.
Spell tags require normalize to be enabled.
# Spell out an email address
audio = client.tts.generate(
text="Contact me at <spell>[email protected]</spell>",
normalize=True,
language="en",
)
# Output: "Contact me at K, A, J, O, at, K, U, G, E, L, A, U, D, I, O, dot, C, O, M"
# Spell out an acronym
audio = client.tts.generate(
text="The <spell>API</spell> is easy to use.",
normalize=True,
language="en",
)
# Output: "The A, P, I is easy to use."
# German example with language-specific translations
audio = client.tts.generate(
text="Meine E-Mail ist <spell>[email protected]</spell>",
normalize=True,
language="de",
)
# Output: "Meine E-Mail ist T, E, S, T, ät, B, E, I, S, P, I, E, L, Punkt, D, E"
// Spell out an email address
const audio = await client.tts.generate({
text: 'Contact me at <spell>[email protected]</spell>',
normalize: true,
language: 'en',
});
// Output: "Contact me at K, A, J, O, at, K, U, G, E, L, A, U, D, I, O, dot, C, O, M"
// Spell out an acronym
const audio2 = await client.tts.generate({
text: 'The <spell>API</spell> is easy to use.',
normalize: true,
language: 'en',
});
// Output: "The A, P, I is easy to use."
// German example with language-specific translations
const audio3 = await client.tts.generate({
text: 'Meine E-Mail ist <spell>[email protected]</spell>',
normalize: true,
language: 'de',
});
// Output: "Meine E-Mail ist T, E, S, T, ät, B, E, I, S, P, I, E, L, Punkt, D, E"
Language-Specific Character Translations
Special characters within <spell> tags are translated based on the language:
| Character | English | German | French | Spanish |
|---|
@ | at | ät | arobase | arroba |
. | dot | Punkt | point | punto |
- | dash | Strich | tiret | guión |
_ | underscore | Unterstrich | underscore | guión bajo |
Spell tags work seamlessly with streaming. When streaming text token-by-token (e.g., from an LLM), tags that span multiple chunks are automatically handled:
async with client.tts.streaming_session(
voice_id=123,
normalize=True,
language="en",
) as session:
# Even if the tag is split across tokens, it works correctly
async for chunk in session.send("My code is <spell>"):
play_audio(chunk.audio)
async for chunk in session.send("ABC123</spell>"):
play_audio(chunk.audio)
async for chunk in session.flush():
play_audio(chunk.audio)
await client.tts.stream(
{
text: 'My verification code is <spell>ABC-123-XYZ</spell>.',
normalize: true,
language: 'en',
},
{
onChunk: (chunk) => playAudio(chunk.audio),
}
);
Streaming Safety: The system buffers text until the closing </spell> tag arrives before generating audio. If the stream ends unexpectedly, incomplete tags are auto-closed so the content still gets spelled out.
Model recommendation: For clearer letter-by-letter pronunciation, use kugel-1 instead of kugel-1-turbo.
When integrating with language models, add instructions to your system prompt so the LLM wraps appropriate text in spell tags:
SYSTEM_PROMPT = """You are a helpful assistant. When you need to spell out text
(like email addresses, codes, or acronyms), wrap it in <spell> tags.
Examples:
- "My email is <spell>[email protected]</spell>"
- "The code is <spell>ABC123</spell>"
- "That stands for <spell>API</spell>, Application Programming Interface"
"""
For more details, see the LLM Integration guide.
Next Steps