Skip to main content
KugelAudio generates speech directly from text. There is no voice direction layer — you shape the output by how you write the input. This section covers every mechanism available to control pronunciation, pacing, and emphasis.

Supported controls at a glance

ControlSyntaxPage
Pauses<break time="300ms"/>, <break strength="medium"/>, <break/>Breaks
Speedspeed request parameter (0.81.2, whole request)Speed
Spell out characters<spell>text</spell>, <spell group="2">…</spell>Spell tags
Custom pronunciationInline IPA — /ˈkuːɡl̩/ — or pronunciation dictionariesPronunciation & IPA
Pacing & intonationPlain punctuation — see belowthis page
<break> and <spell> are the only tags processed in request text. Everything else — including SSML — is stripped before synthesis; see Unsupported tags.

Punctuation as pacing

The model respects natural punctuation cues — no special tags needed:
TechniqueEffect
, commaBrief pause between clauses
. periodSentence-end pause, falling intonation
ellipsisLonger trailing pause
em dashAbrupt pause / interruption feel
? question markRising intonation
! exclamationEnergetic delivery
\n newlineParagraph-level pause (similar to period)
Punctuation is the recommended way to add natural rhythm; reach for <break> tags when you need an explicit silence of a specific length (e.g. before a verification code).

Writing tips

  • Strip markdown before TTS. Asterisks, hashes, and bullet characters are read literally by the model.
  • No emoji. They are read out or garbled.
  • Write numbers as digits when they should be normalized (“You have 3 messages”) and always set language — see Text processing.
  • Keep sentences short and end them with punctuation — this also helps the streaming chunker start generation earlier (why).
  • !, ALL-CAPS, and ?! are prosody cues — the model will deliver them energetically. Use deliberately.

LLM system prompt pattern

When an LLM generates text that feeds directly into TTS, add instructions so it uses the supported controls correctly:
You are a voice assistant. Format your responses for text-to-speech output:

- For email addresses and codes, use <spell> tags:
    "Your code is <spell>ABC-123</spell>"
- For a deliberate pause, use a break tag:
    "Your total is <break time="400ms"/> forty-two euros."
- Do NOT use markdown formatting (**, *, #, -, bullet points) — it will be read aloud literally.
- Do NOT use emoji.
- Do NOT use SSML tags other than <spell> and <break> — they are ignored.
- Keep sentences short. End with punctuation.
- Write numbers as digits when they should be normalized: "You have 3 messages."
For full voice-agent prompt design (turn-taking, error recovery, tool-call acknowledgements), see Voice Agent Prompting.

Unsupported tags

KugelAudio processes <spell>, <break>, and <prosody rate>. All other tags are silently stripped — the inner text is kept but the tag itself has no effect.
Tag / AttributeStatusAlternative
<speak> wrapperStrippedOmit — plain text is assumed
<prosody rate="...">Supported — per-span speedSee Speed
<prosody pitch="...">Rejected (400)No pitch control available
<prosody volume="...">Rejected (400)No volume control available
<emphasis>StrippedRephrase text for natural emphasis
<say-as interpret-as="...">StrippedUse <spell> for characters, normalization for numbers
<sub alias="...">StrippedWrite the spoken form directly, or use a dictionary
<phoneme>StrippedWrite inline IPA between slashes (/ˈkuːɡl̩/)
<audio>, <p>, <s>, <w>, <lang>Stripped
Unknown tags are not validated at request time. Passing unsupported tags will not return an error — the tags are removed and the remaining text is synthesized. Test your output when migrating from a full-SSML provider like Google Cloud TTS, Amazon Polly, or Microsoft Azure.

Next steps

Breaks

Explicit pauses with break tags

Speed

The global speed parameter

Spell tags

Character-by-character pronunciation

Pronunciation & IPA

Fix how specific words are spoken