Generate Speech - KugelAudio

Basic Generation

Generate complete audio and receive it all at once:

audio = client.tts.generate(
    text="Hello, this is a test of the KugelAudio text-to-speech system.",
    model_id="kugel-3",          # Canonical production model (see /models)
    voice_id=1071,               # Optional: specific voice ID
    cfg_scale=2.0,               # Guidance scale (1.0-5.0)
    temperature=None,            # Sampling variance 0.0-1.0; None = server default (~0.5)
    max_new_tokens=2048,         # Maximum tokens to generate
    sample_rate=24000,           # Output sample rate
    normalize=True,              # Enable text normalization (default)
    language="en",               # Language for normalization (see /sdks/python/normalization)
    word_timestamps=False,       # Request word-level timestamps (default: False)
    speed=1.0,                   # Playback speed 0.8-1.2 (pitch-preserving WSOLA)
)

# Audio properties
print(f"Duration: {audio.duration_seconds:.2f}s")
print(f"Samples: {audio.samples}")
print(f"Sample rate: {audio.sample_rate} Hz")
print(f"Generation time: {audio.generation_ms:.0f}ms")
print(f"RTF: {audio.rtf:.2f}")  # Real-time factor

# Save to WAV file
audio.save("output.wav")

# Get raw PCM bytes
pcm_data = audio.audio

# Get WAV bytes (with header)
wav_bytes = audio.to_wav_bytes()

# Get float32 samples in [-1.0, 1.0]
samples = audio.to_float32()

# Save raw PCM instead of WAV
audio.save("output.pcm", format="raw")

Generation parameters

These parameters are accepted by generate(), generate_async(), stream(), and stream_async().

Parameter	Type	Default	Description
`text`	`str`	required	Text to synthesize. Supports `<break time="..."/>` and `<spell>` tags.
`model_id`	`str`	`"kugel-3"`	TTS model. See Models.
`voice_id`	`int \| None`	`None`	Voice to use. Omit for the model default.
`cfg_scale`	`float`	`2.0`	Classifier-free guidance scale (1.0–5.0). Higher tracks the reference voice more tightly.
`temperature`	`float \| None`	`None`	Sampling variance in [0.0, 1.0]. `None` uses the server default (~0.5). `0.0` is most stable (near-greedy); lower values give more consistent reads across regenerations.
`max_new_tokens`	`int`	`2048`	Maximum tokens to generate.
`sample_rate`	`int`	`24000`	Output sample rate in Hz.
`output_format`	`str \| None`	`None`	Combined codec + rate token. Supported native tokens: `pcm_8000`, `pcm_16000`, `pcm_22050`, `pcm_24000`, `ulaw_8000`, `alaw_8000`. When set it must not contradict `sample_rate`.
`normalize`	`bool`	`True`	Enable text normalization (numbers, dates, etc. → spoken words).
`language`	`str \| None`	`None`	ISO 639-1 code for normalization. Always set when known to skip language auto-detection — see Latency.
`word_timestamps`	`bool`	`False`	Request per-word time alignments.
`speed`	`float`	`1.0`	Playback speed multiplier (0.8 = slower, 1.2 = faster). Uses pitch-preserving WSOLA time-stretching; `<prosody rate="...">` spans in the text override it per span — see Speed.
`dictionary_ids`	`list[int] \| None`	`None`	Per-request dictionary selection. `None` = all active project dictionaries (language-filtered); `[]` = none; a list = exactly those dictionaries (including inactive ones), bypassing the language filter.

Async Generation

import asyncio

async def main():
    audio = await client.tts.generate_async(
        text="Async generation example.",
        model_id="kugel-3",
    )
    audio.save("async_output.wav")

asyncio.run(main())

Word Timestamps with Generate

Request word-level time alignments alongside audio when using generate():

audio = client.tts.generate(
    text="Hello, how are you today?",
    model_id="kugel-3",
    word_timestamps=True,
)

# Access word timestamps from the response
for ts in audio.word_timestamps:
    print(f"{ts.word}: {ts.start_ms}ms - {ts.end_ms}ms (score: {ts.score:.2f})")

# Example output:
# Hello: 0ms - 320ms (score: 0.98)
# how: 350ms - 480ms (score: 0.95)
# are: 500ms - 580ms (score: 0.97)
# you: 600ms - 720ms (score: 0.96)
# today: 750ms - 1100ms (score: 0.94)

Word timestamps are also available with async generation:

audio = await client.tts.generate_async(
    text="Hello, world!",
    model_id="kugel-3",
    word_timestamps=True,
)

for ts in audio.word_timestamps:
    print(f"{ts.word}: {ts.start_ms}-{ts.end_ms}ms")

Models

List Available Models

models = client.models.list()

for model in models:
    print(f"{model.id}: {model.name}")
    print(f"  Description: {model.description}")
    print(f"  Max Input: {model.max_input_length} characters")
    print(f"  Sample Rate: {model.sample_rate} Hz")

Next steps

Streaming — receive audio chunks as they are generated
Text Normalization — languages and spell tags
Types & Errors — AudioResponse, WordTimestamp, and exceptions

​Basic Generation

​Generation parameters

​Async Generation

​Word Timestamps with Generate

​Models

​List Available Models

​Next steps