Skip to main content
The official Python SDK for KugelAudio provides a simple, Pythonic interface for text-to-speech generation with both synchronous and asynchronous support.

Installation

pip install kugelaudio
Or with uv (recommended):
uv add kugelaudio

Quick Start

from kugelaudio import KugelAudio

# Initialize the client
client = KugelAudio(api_key="your_api_key")

# Generate speech
audio = client.tts.generate(
    text="Hello, world!",
    model_id="kugel-1-turbo",
)

# Save to file
audio.save("output.wav")

Pre-connecting for Low Latency

For latency-sensitive applications, pre-establish the WebSocket connection at startup to eliminate cold start latency (~500ms) from your first TTS request.
import asyncio
from kugelaudio import KugelAudio

async def main():
    # Create a pre-connected client (~500ms happens here)
    client = await KugelAudio.create(api_key="your_api_key")
    
    # First request is now fast (~100-150ms TTFA instead of ~600ms)
    async for chunk in client.tts.stream_async("Hello, world!"):
        if hasattr(chunk, 'audio'):
            play_audio(chunk.audio)
    
    await client.aclose()

asyncio.run(main())

Sync Applications

For synchronous code, manually call connect() at startup:
from kugelaudio import KugelAudio

# Initialize client
client = KugelAudio(api_key="your_api_key")

# Pre-connect at startup (~500ms happens here)
client.connect()

# Check connection status
print(f"Connected: {client.is_connected()}")

# First request is now fast
for chunk in client.tts.stream("Hello, world!"):
    if hasattr(chunk, 'audio'):
        play_audio(chunk.audio)
Without pre-connecting, the first TTS request includes WebSocket connection setup (~500ms). Subsequent requests reuse the connection and are fast (~100-150ms TTFA). Pre-connecting moves this overhead to application startup.

Client Configuration

from kugelaudio import KugelAudio

# Simple setup
client = KugelAudio(api_key="your_api_key")

# With custom options
client = KugelAudio(
    api_key="your_api_key",           # Required: Your API key
    api_url="https://api.kugelaudio.com",  # Optional: API base URL
    timeout=60.0,                      # Optional: Request timeout in seconds
)

Local Development

For local development, point directly to your TTS server:
client = KugelAudio(
    api_key="your_api_key",
    api_url="http://localhost:8000",
)
Or with separate backend and TTS servers:
client = KugelAudio(
    api_key="your_api_key",
    api_url="http://localhost:8001",   # Backend for REST API
    tts_url="http://localhost:8000",   # TTS server for WebSocket streaming
)

Text-to-Speech

Basic Generation

Generate complete audio and receive it all at once:
audio = client.tts.generate(
    text="Hello, this is a test of the KugelAudio text-to-speech system.",
    model_id="kugel-1-turbo",  # 'kugel-1-turbo' (fast) or 'kugel-1' (quality)
    voice_id=123,               # Optional: specific voice ID
    cfg_scale=2.0,              # Guidance scale (1.0-5.0)
    max_new_tokens=2048,        # Maximum tokens to generate
    sample_rate=24000,          # Output sample rate
    normalize=True,             # Enable text normalization (default)
    language="en",              # Language for normalization (see below)
    word_timestamps=False,      # Request word-level timestamps (default: False)
)

# Audio properties
print(f"Duration: {audio.duration_seconds:.2f}s")
print(f"Samples: {audio.samples}")
print(f"Sample rate: {audio.sample_rate} Hz")
print(f"Generation time: {audio.generation_ms:.0f}ms")
print(f"RTF: {audio.rtf:.2f}")  # Real-time factor

# Save to WAV file
audio.save("output.wav")

# Get raw PCM bytes
pcm_data = audio.audio

# Get WAV bytes (with header)
wav_bytes = audio.to_wav_bytes()

Streaming Audio

Receive audio chunks as they are generated for lower latency:
# Synchronous streaming
for item in client.tts.stream(
    text="Hello, this is streaming audio.",
    model_id="kugel-1-turbo",
):
    if hasattr(item, 'audio'):  # AudioChunk
        # Process audio chunk immediately
        print(f"Chunk {item.index}: {len(item.audio)} bytes, {item.samples} samples")
        # play_audio(item.audio)
    elif isinstance(item, dict) and item.get('final'):
        # Final stats
        print(f"Total duration: {item.get('dur_ms', 0):.0f}ms")
        print(f"Generation time: {item.get('gen_ms', 0):.0f}ms")

Async Streaming

For async applications:
import asyncio

async def generate_speech():
    async for item in client.tts.stream_async(
        text="Async streaming example.",
        model_id="kugel-1-turbo",
    ):
        if hasattr(item, 'audio'):
            # Process chunk
            pass

asyncio.run(generate_speech())

Async Generation

import asyncio

async def main():
    audio = await client.tts.generate_async(
        text="Async generation example.",
        model_id="kugel-1-turbo",
    )
    audio.save("async_output.wav")

asyncio.run(main())

Text Normalization

Text normalization converts numbers, dates, times, and other non-verbal text into spoken words:
  • “I have 3 apples” → “I have three apples”
  • “The meeting is at 2:30 PM” → “The meeting is at two thirty PM”
  • “€50.99” → “fifty euros and ninety-nine cents”
# With explicit language (recommended - fastest)
audio = client.tts.generate(
    text="I bought 3 items for €50.99 on 01/15/2024.",
    normalize=True,
    language="en",  # Specify language for best performance
)

# With auto-detection (may cause incorrect normalizations)
audio = client.tts.generate(
    text="Ich habe 3 Artikel für 50,99€ gekauft.",
    normalize=True,
    # language not specified - will auto-detect
)

Supported Languages

CodeLanguageCodeLanguage
deGermannlDutch
enEnglishplPolish
frFrenchsvSwedish
esSpanishdaDanish
itItaliannoNorwegian
ptPortuguesefiFinnish
csCzechhuHungarian
roRomanianelGreek
ukUkrainianbgBulgarian
trTurkishviVietnamese
arArabichiHindi
zhChinesejaJapanese
koKorean
Using normalize=True without specifying language may cause incorrect normalizations, especially for short texts or languages that share similar vocabulary. Always specify language when you know it.

Spell Tags

Use <spell> tags to spell out text letter by letter. This is useful for email addresses, codes, acronyms, or any text that should be pronounced character by character:
# Spell out an email address
audio = client.tts.generate(
    text="Contact me at <spell>kajo@kugelaudio.com</spell>",
    normalize=True,
    language="en",
)
# Output: "Contact me at K, A, J, O, at, K, U, G, E, L, A, U, D, I, O, dot, C, O, M"

# Spell out an acronym
audio = client.tts.generate(
    text="The <spell>API</spell> is easy to use.",
    normalize=True,
    language="en",
)
# Output: "The A, P, I is easy to use."

# German example with language-specific translations
audio = client.tts.generate(
    text="Meine E-Mail ist <spell>test@beispiel.de</spell>",
    normalize=True,
    language="de",
)
# Output: "Meine E-Mail ist T, E, S, T, ät, B, E, I, S, P, I, E, L, Punkt, D, E"
Spell tags also work with streaming:
# Streaming with spell tags - tags spanning chunks are handled automatically
async with client.tts.streaming_session(
    voice_id=123,
    normalize=True,
    language="en",
) as session:
    # Even if the tag is split across tokens, it works correctly
    async for chunk in session.send("My code is <spell>"):
        play_audio(chunk.audio)
    async for chunk in session.send("ABC123</spell>"):
        play_audio(chunk.audio)
    async for chunk in session.flush():
        play_audio(chunk.audio)
Special Characters: Characters like @, ., - are translated to language-specific words. For example, @ becomes “at” in English, “ät” in German, and “arobase” in French.
Model recommendation: For clearer letter-by-letter pronunciation, use model_id="kugel-1" instead of kugel-1-turbo.

LLM Integration: Streaming Sessions

For real-time TTS when streaming text from an LLM (like GPT-4, Claude, etc.):

Async Streaming Session

import asyncio

async def stream_from_llm():
    # Simulate LLM token stream
    llm_tokens = ["Hello, ", "this ", "is ", "a ", "streamed ", "response."]
    
    async with client.tts.streaming_session(
        voice_id=123,
        cfg_scale=2.0,
        flush_timeout_ms=500,  # Auto-flush after 500ms of no input
    ) as session:
        # Send tokens as they arrive from LLM
        for token in llm_tokens:
            async for chunk in session.send(token):
                # Play audio chunk immediately
                play_audio(chunk.audio)
        
        # Flush any remaining text
        async for chunk in session.flush():
            play_audio(chunk.audio)

asyncio.run(stream_from_llm())

Synchronous Streaming Session

with client.tts.streaming_session_sync(voice_id=123) as session:
    for token in llm_tokens:
        for chunk in session.send(token):
            play_audio(chunk.audio)
    
    for chunk in session.flush():
        play_audio(chunk.audio)

Word Timestamps with Generate

Request word-level time alignments alongside audio when using generate():
audio = client.tts.generate(
    text="Hello, how are you today?",
    model_id="kugel-1-turbo",
    word_timestamps=True,
)

# Access word timestamps from the response
for ts in audio.word_timestamps:
    print(f"{ts.word}: {ts.start_ms}ms - {ts.end_ms}ms (score: {ts.score:.2f})")

# Example output:
# Hello: 0ms - 320ms (score: 0.98)
# how: 350ms - 480ms (score: 0.95)
# are: 500ms - 580ms (score: 0.97)
# you: 600ms - 720ms (score: 0.96)
# today: 750ms - 1100ms (score: 0.94)
Word timestamps are also available with async generation:
audio = await client.tts.generate_async(
    text="Hello, world!",
    model_id="kugel-1-turbo",
    word_timestamps=True,
)

for ts in audio.word_timestamps:
    print(f"{ts.word}: {ts.start_ms}-{ts.end_ms}ms")

Word Timestamps in Streaming

Word timestamps work with all streaming methods. During streaming, they are yielded as list[WordTimestamp] objects between audio chunks:
from kugelaudio.models import WordTimestamp

for item in client.tts.stream(
    text="Hello, how are you today?",
    model_id="kugel-1-turbo",
    word_timestamps=True,
):
    if hasattr(item, 'audio'):  # AudioChunk
        play_audio(item.audio)
    elif isinstance(item, list) and item and isinstance(item[0], WordTimestamp):
        for ts in item:
            print(f"{ts.word}: {ts.start_ms}-{ts.end_ms}ms")

Word Timestamps in Streaming Sessions

Request word-level time alignments alongside audio. Timestamps are delivered per chunk after the corresponding audio data:
async with client.tts.streaming_session(
    voice_id=123,
    word_timestamps=True,
) as session:
    async for chunk in session.send("Hello, how are you today?"):
        play_audio(chunk.audio)
    
    async for chunk in session.flush():
        play_audio(chunk.audio)
    
    # Access the latest word timestamps
    timestamps = session.last_word_timestamps
    for ts in timestamps:
        print(f"{ts.word}: {ts.start_ms}ms - {ts.end_ms}ms (score: {ts.score:.2f})")
You can also register a callback to process timestamps as they arrive:
def on_timestamps(timestamps):
    for ts in timestamps:
        print(f"  {ts.word} [{ts.start_ms}-{ts.end_ms}ms]")

async with client.tts.streaming_session(
    voice_id=123,
    on_word_timestamps=on_timestamps,
) as session:
    async for chunk in session.send("Hello world!"):
        play_audio(chunk.audio)
    async for chunk in session.flush():
        play_audio(chunk.audio)
Word timestamps add no extra audio latency. They arrive ~50-200ms after the corresponding audio chunk and are useful for barge-in handling, subtitle synchronization, and lip-sync.

Voices

List Available Voices

# List all available voices
voices = client.voices.list()

for voice in voices:
    print(f"{voice.id}: {voice.name}")
    print(f"  Category: {voice.category}")
    print(f"  Languages: {', '.join(voice.supported_languages)}")

# Filter by language
german_voices = client.voices.list(language="de")

# Get only public voices
public_voices = client.voices.list(include_public=True)

# Limit results
first_10 = client.voices.list(limit=10)

Get a Specific Voice

voice = client.voices.get(voice_id=123)
print(f"Voice: {voice.name}")
print(f"Category: {voice.category}")

Create a Voice

Create a new voice with optional reference audio files:
voice = client.voices.create(
    name="My Custom Voice",
    sex="female",
    description="A warm, conversational voice",
    category="cloned",
    reference_files=["reference1.wav", "reference2.wav"],
)
print(f"Created voice: {voice.id}")
The reference_files parameter accepts file paths (str or Path) to audio files (WAV, MP3, FLAC).

Update a Voice

voice = client.voices.update(
    voice_id=123,
    name="Updated Name",
    description="New description",
)

Delete a Voice

client.voices.delete(voice_id=123)

Manage Reference Audio

# List references for a voice
refs = client.voices.list_references(voice_id=123)
for ref in refs:
    print(f"{ref.id}: {ref.name}")

# Add a new reference
ref = client.voices.add_reference(
    voice_id=123,
    file_path="new_reference.wav",
    reference_text="Optional transcript of the audio.",
)

# Delete a reference
client.voices.delete_reference(voice_id=123, reference_id=456)

Publish a Voice

Request that your voice be made publicly available. An admin will verify it before it becomes visible to others.
voice = client.voices.publish(voice_id=123)
print(f"Pending verification: {voice.pending_verification}")

Generate Voice Sample

Trigger sample audio generation for a voice:
voice = client.voices.generate_sample(voice_id=123)
print(f"Sample URL: {voice.sample_url}")

Models

List Available Models

models = client.models.list()

for model in models:
    print(f"{model.id}: {model.name}")
    print(f"  Description: {model.description}")
    print(f"  Parameters: {model.parameters}")
    print(f"  Max Input: {model.max_input_length} characters")
    print(f"  Sample Rate: {model.sample_rate} Hz")

Error Handling

from kugelaudio import KugelAudio
from kugelaudio.exceptions import (
    KugelAudioError,
    AuthenticationError,
    RateLimitError,
    InsufficientCreditsError,
    ValidationError,
)

try:
    audio = client.tts.generate(text="Hello!")
except AuthenticationError:
    print("Invalid API key")
except RateLimitError:
    print("Rate limit exceeded, please wait")
except InsufficientCreditsError:
    print("Not enough credits, please top up")
except ValidationError as e:
    print(f"Invalid request: {e}")
except KugelAudioError as e:
    print(f"API error: {e}")

Data Models

AudioChunk

Represents a single audio chunk from streaming:
class AudioChunk:
    audio: bytes          # Raw PCM16 audio data
    encoding: str         # 'pcm_s16le'
    index: int           # Chunk index (0-based)
    sample_rate: int     # Sample rate (24000)
    samples: int         # Number of samples in chunk
    
    @property
    def duration_seconds(self) -> float:
        """Duration of this chunk in seconds."""

AudioResponse

Complete audio response from generation:
class AudioResponse:
    audio: bytes                          # Complete PCM16 audio
    sample_rate: int                      # Sample rate (24000)
    samples: int                          # Total samples
    duration_ms: float                    # Duration in milliseconds
    generation_ms: float                  # Generation time in milliseconds
    rtf: float                           # Real-time factor
    word_timestamps: list[WordTimestamp]  # Per-word timing (when word_timestamps=True)
    
    @property
    def duration_seconds(self) -> float:
        """Duration in seconds."""
    
    def save(self, path: str) -> None:
        """Save as WAV file."""
    
    def to_wav_bytes(self) -> bytes:
        """Get WAV file as bytes."""

WordTimestamp

Word-level time alignment for a generated audio chunk:
class WordTimestamp:
    word: str          # The aligned word
    start_ms: int      # Start time in milliseconds (relative to chunk)
    end_ms: int        # End time in milliseconds (relative to chunk)
    char_start: int    # Start character offset in original text
    char_end: int      # End character offset in original text
    score: float       # Alignment confidence (0.0 - 1.0)

Model

TTS model information:
class Model:
    id: str                   # 'kugel-1-turbo' or 'kugel-1'
    name: str                 # Human-readable name
    description: str          # Model description
    max_input_length: int     # Maximum input characters
    sample_rate: int          # Output sample rate

Voice

Voice information (returned by list):
class Voice:
    id: int                          # Voice ID
    voice_id: int                    # Same as id (backward compat)
    name: str                        # Voice name
    description: Optional[str]       # Description
    category: str                    # 'premade', 'cloned', 'designed', 'conversational', 'narrative', 'narrative_story', 'characters'
    sex: Optional[str]               # 'male', 'female', 'neutral'
    age: Optional[str]               # 'young', 'middle_aged', 'old'
    supported_languages: List[str]   # ['en', 'de', ...]
    avatar_url: Optional[str]        # Avatar image URL
    sample_url: Optional[str]        # Sample audio URL

VoiceDetail

Extended voice information (returned by create, update, get, publish, generate_sample):
class VoiceDetail:
    id: int
    name: str
    description: str
    generative_voice_description: str
    supported_languages: List[str]
    category: str
    age: Optional[str]
    sex: Optional[str]
    quality: str                  # 'low', 'mid', 'high'
    is_public: bool
    verified: bool
    pending_verification: bool
    sample_url: Optional[str]
    avatar_url: Optional[str]
    sample_text: str

VoiceReference

Voice reference audio metadata:
class VoiceReference:
    id: int
    voice_id: int
    name: str
    reference_text: str
    s3_path: str
    audio_url: Optional[str]
    is_generated: bool

Complete Example

from kugelaudio import KugelAudio

# Initialize client
client = KugelAudio(api_key="your_api_key")

# List available models
print("Available Models:")
for model in client.models.list():
    print(f"  - {model.id}: {model.name} ({model.parameters})")

# List available voices
print("\nAvailable Voices:")
for voice in client.voices.list(limit=5):
    print(f"  - {voice.id}: {voice.name}")

# Generate audio
print("\nGenerating audio...")
audio = client.tts.generate(
    text="Welcome to KugelAudio. This is an example of high-quality text-to-speech synthesis.",
    model_id="kugel-1-turbo",
)

print(f"Generated {audio.duration_seconds:.2f}s of audio in {audio.generation_ms:.0f}ms")
print(f"Real-time factor: {audio.rtf:.2f}x")

# Save to file
audio.save("example.wav")
print("Saved to example.wav")

# Close client
client.close()