Skip to main content

Basic Generation

Generate complete audio and receive it all at once:
audio = client.tts.generate(
    text="Hello, this is a test of the KugelAudio text-to-speech system.",
    model_id="kugel-3",          # Canonical production model (see /models)
    voice_id=1071,               # Optional: specific voice ID
    cfg_scale=2.0,               # Guidance scale (1.0-5.0)
    temperature=None,            # Sampling variance 0.0-1.0; None = server default (~0.5)
    max_new_tokens=2048,         # Maximum tokens to generate
    sample_rate=24000,           # Output sample rate
    normalize=True,              # Enable text normalization (default)
    language="en",               # Language for normalization (see /sdks/python/normalization)
    word_timestamps=False,       # Request word-level timestamps (default: False)
    speed=1.0,                   # Playback speed 0.8-1.2 (pitch-preserving WSOLA)
)

# Audio properties
print(f"Duration: {audio.duration_seconds:.2f}s")
print(f"Samples: {audio.samples}")
print(f"Sample rate: {audio.sample_rate} Hz")
print(f"Generation time: {audio.generation_ms:.0f}ms")
print(f"RTF: {audio.rtf:.2f}")  # Real-time factor

# Save to WAV file
audio.save("output.wav")

# Get raw PCM bytes
pcm_data = audio.audio

# Get WAV bytes (with header)
wav_bytes = audio.to_wav_bytes()

# Get float32 samples in [-1.0, 1.0]
samples = audio.to_float32()

# Save raw PCM instead of WAV
audio.save("output.pcm", format="raw")

Generation parameters

These parameters are accepted by generate(), generate_async(), stream(), and stream_async().
ParameterTypeDefaultDescription
textstrrequiredText to synthesize. Supports <break time="..."/> and <spell> tags.
model_idstr"kugel-3"TTS model. See Models.
voice_idint | NoneNoneVoice to use. Omit for the model default.
cfg_scalefloat2.0Classifier-free guidance scale (1.0–5.0). Higher tracks the reference voice more tightly.
temperaturefloat | NoneNoneSampling variance in [0.0, 1.0]. None uses the server default (~0.5). 0.0 is most stable (near-greedy); lower values give more consistent reads across regenerations.
max_new_tokensint2048Maximum tokens to generate.
sample_rateint24000Output sample rate in Hz.
output_formatstr | NoneNoneCombined codec + rate token. Supported native tokens: pcm_8000, pcm_16000, pcm_22050, pcm_24000, ulaw_8000, alaw_8000. When set it must not contradict sample_rate.
normalizeboolTrueEnable text normalization (numbers, dates, etc. → spoken words).
languagestr | NoneNoneISO 639-1 code for normalization. Always set when known to skip language auto-detection — see Latency.
word_timestampsboolFalseRequest per-word time alignments.
speedfloat1.0Playback speed multiplier (0.8 = slower, 1.2 = faster). Uses pitch-preserving WSOLA time-stretching; <prosody rate="..."> spans in the text override it per span — see Speed.
dictionary_idslist[int] | NoneNonePer-request dictionary selection. None = all active project dictionaries (language-filtered); [] = none; a list = exactly those dictionaries (including inactive ones), bypassing the language filter.

Async Generation

import asyncio

async def main():
    audio = await client.tts.generate_async(
        text="Async generation example.",
        model_id="kugel-3",
    )
    audio.save("async_output.wav")

asyncio.run(main())

Word Timestamps with Generate

Request word-level time alignments alongside audio when using generate():
audio = client.tts.generate(
    text="Hello, how are you today?",
    model_id="kugel-3",
    word_timestamps=True,
)

# Access word timestamps from the response
for ts in audio.word_timestamps:
    print(f"{ts.word}: {ts.start_ms}ms - {ts.end_ms}ms (score: {ts.score:.2f})")

# Example output:
# Hello: 0ms - 320ms (score: 0.98)
# how: 350ms - 480ms (score: 0.95)
# are: 500ms - 580ms (score: 0.97)
# you: 600ms - 720ms (score: 0.96)
# today: 750ms - 1100ms (score: 0.94)
Word timestamps are also available with async generation:
audio = await client.tts.generate_async(
    text="Hello, world!",
    model_id="kugel-3",
    word_timestamps=True,
)

for ts in audio.word_timestamps:
    print(f"{ts.word}: {ts.start_ms}-{ts.end_ms}ms")

Models

List Available Models

models = client.models.list()

for model in models:
    print(f"{model.id}: {model.name}")
    print(f"  Description: {model.description}")
    print(f"  Max Input: {model.max_input_length} characters")
    print(f"  Sample Rate: {model.sample_rate} Hz")

Next steps