Skip to main content
Generate complete audio from text. This is the simplest way to get started - provide text and receive audio back.

Basic Generation

from kugelaudio import KugelAudio

client = KugelAudio(api_key="your_api_key")

audio = client.tts.generate(
    text="Hello, this is a test of the KugelAudio text-to-speech system.",
    model_id="kugel-1-turbo",
)

# Save to file
audio.save("output.wav")

# Or get WAV bytes
wav_bytes = audio.to_wav_bytes()

Generation Parameters

ParameterTypeDefaultDescription
textstringrequiredThe text to synthesize
model_id / modelIdstringkugel-1-turbokugel-1-turbo (fast) or kugel-1 (quality)
voice_id / voiceIdint-Specific voice to use
cfg_scale / cfgScalefloat2.0Guidance scale (1.0-5.0)
max_new_tokens / maxNewTokensint2048Maximum tokens to generate
sample_rate / sampleRateint24000Output sample rate (8000, 16000, 22050, 24000)
normalizebooltrueEnable text normalization
languagestring-Language for normalization (ISO 639-1 code)
word_timestamps / wordTimestampsboolfalseRequest word-level timestamps
speedfloat1.0Playback speed multiplier — 0.8 (slower) to 1.2 (faster). Uses pitch-preserving WSOLA.

CFG Scale Guide

The cfg_scale parameter controls how closely the model follows the voice characteristics:
RangeStyleBest For
1.0-1.5Relaxed, naturalConversational AI, long-form narration
2.0Balanced (default)General purpose
2.5-3.0ExpressiveStorytelling, emphasis-heavy content
3.5-5.0Maximum expressionCharacter voices, dramatic readings

Speed Control

The speed parameter adjusts playback rate using pitch-preserving time-stretching (WSOLA), so the voice pitch stays natural even at different speeds. Range: 0.8 (20% slower) to 1.2 (20% faster).
Dashboard: The playground in the KugelAudio dashboard includes a Slow / Normal / Fast speed toggle next to the model selector. Changes are reflected live in the SDK code snippet shown below the generator.
For fine-grained control, use inline <prosody rate="..."> tags to slow down only specific parts of the text — useful for phone numbers, addresses, or other content that benefits from slower delivery:
# Global speed — whole sentence at 80% speed
audio = client.tts.generate(
    text="Bitte rufen Sie uns an unter: 0 30 12 34 56 78.",
    language="de",
    speed=0.8,
)

# Inline prosody — only the phone number slowed down
audio = client.tts.generate(
    text='Bitte rufen Sie uns an unter: <prosody rate="slow">0 30 12 34 56 78.</prosody>',
    language="de",
)
speed valueRateTypical use
0.820% slowerPhone numbers, addresses, medical terms
1.0Normal (default)General purpose
1.220% fasterNotifications, fast-paced content
Use <prosody rate="slow"> or <prosody rate="fast"> inline tags to vary speed within a single sentence without needing multiple API calls.

Unsupported SSML Tags

KugelAudio supports a subset of SSML focused on <prosody rate> and <spell>. Full SSML is not supported — the following tags are silently stripped or will produce unexpected output:
Tag / AttributeStatusAlternative
<speak> wrapperNot supportedOmit — plain text is assumed
<prosody pitch="...">Not supportedNo pitch control available
<prosody volume="...">Not supportedNo volume control available
<prosody duration="...">Not supportedUse speed parameter instead
<emphasis>Not supportedNo emphasis tag processing
<break>Not supportedAdd punctuation (., ,) for natural pauses
<say-as>Not supportedUse <spell> for character-by-character output
<audio>, <p>, <s>, <w>Not supported
Unsupported tags are not validated — they are stripped from the text before synthesis. If you pass <prosody pitch="high"> the pitch attribute is ignored and the inner text is synthesized at the default pitch. Always test output when migrating from a full-SSML TTS provider.

Full Example with All Options

audio = client.tts.generate(
    text="Hello, this is a test of the KugelAudio text-to-speech system.",
    model_id="kugel-1-turbo",
    voice_id=123,
    cfg_scale=2.0,
    max_new_tokens=2048,
    sample_rate=24000,
    normalize=True,
    language="en",
    word_timestamps=False,
    speed=1.0,
)

# Inspect the response
print(f"Duration: {audio.duration_seconds:.2f}s")
print(f"Samples: {audio.samples}")
print(f"Sample rate: {audio.sample_rate} Hz")
print(f"Generation time: {audio.generation_ms:.0f}ms")
print(f"RTF: {audio.rtf:.2f}")

# Save to WAV file
audio.save("output.wav")

# Get raw PCM bytes
pcm_data = audio.audio

# Get WAV bytes (with header)
wav_bytes = audio.to_wav_bytes()

Async Generation

import asyncio

async def main():
    audio = await client.tts.generate_async(
        text="Async generation example.",
        model_id="kugel-1-turbo",
    )
    audio.save("async_output.wav")

asyncio.run(main())

Playing Audio in the Browser

The JavaScript SDK provides utility functions for audio playback:
import { KugelAudio, createWavBlob } from 'kugelaudio';

const client = new KugelAudio({ apiKey: 'your_api_key' });

const audio = await client.tts.generate({
  text: 'Hello, world!',
  modelId: 'kugel-1-turbo',
});

// Create WAV blob for playback
const wavBlob = createWavBlob(audio.audio, audio.sampleRate);
const url = URL.createObjectURL(wavBlob);

// Play with Audio element
const audioElement = new Audio(url);
audioElement.play();

// Or with Web Audio API
const audioContext = new AudioContext();
const arrayBuffer = await wavBlob.arrayBuffer();
const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
source.start();

Pre-connecting for Low Latency

For latency-sensitive applications, pre-establish the WebSocket connection at startup to eliminate cold start latency (~500ms) from your first request.
import asyncio
from kugelaudio import KugelAudio

async def main():
    # Create a pre-connected client (~500ms happens here)
    client = await KugelAudio.create(api_key="your_api_key")
    
    # First request is now fast (~100-150ms TTFA instead of ~600ms)
    audio = await client.tts.generate_async(
        text="Hello, world!",
        model_id="kugel-1-turbo",
    )
    audio.save("output.wav")
    
    await client.aclose()

asyncio.run(main())
Without pre-connecting, the first TTS request includes WebSocket connection setup (~500ms). Subsequent requests reuse the connection and are fast (~100-150ms TTFA). Pre-connecting moves this overhead to application startup.

Word Timestamps

Request per-word time alignments alongside the generated audio. Useful for subtitles, karaoke, lip-sync, and barge-in handling.
audio = client.tts.generate(
    text="Hello, how are you today?",
    model_id="kugel-1-turbo",
    word_timestamps=True,
)

for ts in audio.word_timestamps:
    print(f"{ts.word}: {ts.start_ms}ms - {ts.end_ms}ms (score: {ts.score:.2f})")

# Output:
# Hello: 0ms - 320ms (score: 0.98)
# how: 350ms - 480ms (score: 0.95)
# are: 500ms - 580ms (score: 0.97)
# you: 600ms - 720ms (score: 0.96)
# today: 750ms - 1100ms (score: 0.94)
Word timestamps add no extra audio latency. For streaming use cases, see the Streaming Guide.

Next Steps

Streaming

Lower latency with real-time audio streaming

Text Processing

Text normalization and spell tags

Voices

Browse and use different voices

Models

Learn about available models