Python SDK

The official Python SDK for KugelAudio provides a simple, Pythonic interface for text-to-speech generation with both synchronous and asynchronous support.

Installation

pip install kugelaudio

Or with uv (recommended):

uv add kugelaudio

Quick Start

from kugelaudio import KugelAudio

# Initialize the client
client = KugelAudio(api_key="your_api_key")

# Generate speech
audio = client.tts.generate(
    text="Hello, world!",
    model_id="kugel-1-turbo",
)

# Save to file
audio.save("output.wav")

Pre-connecting for Low Latency

For latency-sensitive applications, pre-establish the WebSocket connection at startup to eliminate cold start latency (~500ms) from your first TTS request.

Async Applications (Recommended)

import asyncio
from kugelaudio import KugelAudio

async def main():
    # Create a pre-connected client (~500ms happens here)
    client = await KugelAudio.create(api_key="your_api_key")
    
    # First request is now fast (~100-150ms TTFA instead of ~600ms)
    async for chunk in client.tts.stream_async("Hello, world!"):
        if hasattr(chunk, 'audio'):
            play_audio(chunk.audio)
    
    await client.aclose()

asyncio.run(main())

Sync Applications

For synchronous code, manually call connect() at startup:

from kugelaudio import KugelAudio

# Initialize client
client = KugelAudio(api_key="your_api_key")

# Pre-connect at startup (~500ms happens here)
client.connect()

# Check connection status
print(f"Connected: {client.is_connected()}")

# First request is now fast
for chunk in client.tts.stream("Hello, world!"):
    if hasattr(chunk, 'audio'):
        play_audio(chunk.audio)

Without pre-connecting, the first TTS request includes WebSocket connection setup (~500ms). Subsequent requests reuse the connection and are fast (~100-150ms TTFA). Pre-connecting moves this overhead to application startup.

Client Configuration

from kugelaudio import KugelAudio

# Simple setup
client = KugelAudio(api_key="your_api_key")

# With custom options
client = KugelAudio(
    api_key="your_api_key",           # Required: Your API key
    api_url="https://api.kugelaudio.com",  # Optional: API base URL
    timeout=60.0,                      # Optional: Request timeout in seconds
)

Local Development

For local development, point directly to your TTS server:

client = KugelAudio(
    api_key="your_api_key",
    api_url="http://localhost:8000",
)

Or with separate backend and TTS servers:

client = KugelAudio(
    api_key="your_api_key",
    api_url="http://localhost:8001",   # Backend for REST API
    tts_url="http://localhost:8000",   # TTS server for WebSocket streaming
)

Text-to-Speech

Basic Generation

Generate complete audio and receive it all at once:

audio = client.tts.generate(
    text="Hello, this is a test of the KugelAudio text-to-speech system.",
    model_id="kugel-1-turbo",  # 'kugel-1-turbo' (fast) or 'kugel-1' (quality)
    voice_id=123,               # Optional: specific voice ID
    cfg_scale=2.0,              # Guidance scale (1.0-5.0)
    max_new_tokens=2048,        # Maximum tokens to generate
    sample_rate=24000,          # Output sample rate
    normalize=True,             # Enable text normalization (default)
    language="en",              # Language for normalization (see below)
)

# Audio properties
print(f"Duration: {audio.duration_seconds:.2f}s")
print(f"Samples: {audio.samples}")
print(f"Sample rate: {audio.sample_rate} Hz")
print(f"Generation time: {audio.generation_ms:.0f}ms")
print(f"RTF: {audio.rtf:.2f}")  # Real-time factor

# Save to WAV file
audio.save("output.wav")

# Get raw PCM bytes
pcm_data = audio.audio

# Get WAV bytes (with header)
wav_bytes = audio.to_wav_bytes()

Streaming Audio

Receive audio chunks as they are generated for lower latency:

# Synchronous streaming
for item in client.tts.stream(
    text="Hello, this is streaming audio.",
    model_id="kugel-1-turbo",
):
    if hasattr(item, 'audio'):  # AudioChunk
        # Process audio chunk immediately
        print(f"Chunk {item.index}: {len(item.audio)} bytes, {item.samples} samples")
        # play_audio(item.audio)
    elif isinstance(item, dict) and item.get('final'):
        # Final stats
        print(f"Total duration: {item.get('dur_ms', 0):.0f}ms")
        print(f"Time to first audio: {item.get('ttfa_ms', 0):.0f}ms")

Async Streaming

For async applications:

import asyncio

async def generate_speech():
    async for item in client.tts.stream_async(
        text="Async streaming example.",
        model_id="kugel-1-turbo",
    ):
        if hasattr(item, 'audio'):
            # Process chunk
            pass

asyncio.run(generate_speech())

Async Generation

import asyncio

async def main():
    audio = await client.tts.generate_async(
        text="Async generation example.",
        model_id="kugel-1-turbo",
    )
    audio.save("async_output.wav")

asyncio.run(main())

Text Normalization

Text normalization converts numbers, dates, times, and other non-verbal text into spoken words:

“I have 3 apples” → “I have three apples”
“The meeting is at 2:30 PM” → “The meeting is at two thirty PM”
“€50.99” → “fifty euros and ninety-nine cents”

# With explicit language (recommended - fastest)
audio = client.tts.generate(
    text="I bought 3 items for €50.99 on 01/15/2024.",
    normalize=True,
    language="en",  # Specify language for best performance
)

# With auto-detection (adds ~150ms latency)
audio = client.tts.generate(
    text="Ich habe 3 Artikel für 50,99€ gekauft.",
    normalize=True,
    # language not specified - will auto-detect
)

Supported Languages

Code	Language	Code	Language
`de`	German	`nl`	Dutch
`en`	English	`pl`	Polish
`fr`	French	`sv`	Swedish
`es`	Spanish	`da`	Danish
`it`	Italian	`no`	Norwegian
`pt`	Portuguese	`fi`	Finnish
`cs`	Czech	`hu`	Hungarian
`ro`	Romanian	`el`	Greek
`uk`	Ukrainian	`bg`	Bulgarian
`tr`	Turkish	`vi`	Vietnamese
`ar`	Arabic	`hi`	Hindi
`zh`	Chinese	`ja`	Japanese
`ko`	Korean

Using normalize=True without specifying language adds approximately 150ms latency for language auto-detection. For best performance in latency-sensitive applications, always specify the language parameter.

Spell Tags

Use <spell> tags to spell out text letter by letter. This is useful for email addresses, codes, acronyms, or any text that should be pronounced character by character:

# Spell out an email address
audio = client.tts.generate(
    text="Contact me at <spell>[email protected]</spell>",
    normalize=True,
    language="en",
)
# Output: "Contact me at K, A, J, O, at, K, U, G, E, L, A, U, D, I, O, dot, C, O, M"

# Spell out an acronym
audio = client.tts.generate(
    text="The <spell>API</spell> is easy to use.",
    normalize=True,
    language="en",
)
# Output: "The A, P, I is easy to use."

# German example with language-specific translations
audio = client.tts.generate(
    text="Meine E-Mail ist <spell>[email protected]</spell>",
    normalize=True,
    language="de",
)
# Output: "Meine E-Mail ist T, E, S, T, ät, B, E, I, S, P, I, E, L, Punkt, D, E"

Spell tags also work with streaming:

# Streaming with spell tags - tags spanning chunks are handled automatically
async with client.tts.streaming_session(
    voice_id=123,
    normalize=True,
    language="en",
) as session:
    # Even if the tag is split across tokens, it works correctly
    async for chunk in session.send("My code is <spell>"):
        play_audio(chunk.audio)
    async for chunk in session.send("ABC123</spell>"):
        play_audio(chunk.audio)
    async for chunk in session.flush():
        play_audio(chunk.audio)

Special Characters: Characters like @, ., - are translated to language-specific words. For example, @ becomes “at” in English, “ät” in German, and “arobase” in French.

Model recommendation: For clearer letter-by-letter pronunciation, use model_id="kugel-1" instead of kugel-1-turbo.

LLM Integration: Streaming Sessions

For real-time TTS when streaming text from an LLM (like GPT-4, Claude, etc.):

Async Streaming Session

import asyncio

async def stream_from_llm():
    # Simulate LLM token stream
    llm_tokens = ["Hello, ", "this ", "is ", "a ", "streamed ", "response."]
    
    async with client.tts.streaming_session(
        voice_id=123,
        cfg_scale=2.0,
        flush_timeout_ms=500,  # Auto-flush after 500ms of no input
    ) as session:
        # Send tokens as they arrive from LLM
        for token in llm_tokens:
            async for chunk in session.send(token):
                # Play audio chunk immediately
                play_audio(chunk.audio)
        
        # Flush any remaining text
        async for chunk in session.flush():
            play_audio(chunk.audio)

asyncio.run(stream_from_llm())

Synchronous Streaming Session

with client.tts.streaming_session_sync(voice_id=123) as session:
    for token in llm_tokens:
        for chunk in session.send(token):
            play_audio(chunk.audio)
    
    for chunk in session.flush():
        play_audio(chunk.audio)

Word Timestamps in Streaming Sessions

Request word-level time alignments alongside audio. Timestamps are delivered per chunk after the corresponding audio data:

async with client.tts.streaming_session(
    voice_id=123,
    word_timestamps=True,  # enabled by default
) as session:
    async for chunk in session.send("Hello, how are you today?"):
        play_audio(chunk.audio)
    
    async for chunk in session.flush():
        play_audio(chunk.audio)
    
    # Access the latest word timestamps
    timestamps = session.last_word_timestamps
    for ts in timestamps:
        print(f"{ts.word}: {ts.start_ms}ms - {ts.end_ms}ms (score: {ts.score:.2f})")

You can also register a callback to process timestamps as they arrive:

def on_timestamps(timestamps):
    for ts in timestamps:
        print(f"  {ts.word} [{ts.start_ms}-{ts.end_ms}ms]")

async with client.tts.streaming_session(
    voice_id=123,
    on_word_timestamps=on_timestamps,
) as session:
    async for chunk in session.send("Hello world!"):
        play_audio(chunk.audio)
    async for chunk in session.flush():
        play_audio(chunk.audio)

Word timestamps add no extra audio latency. They arrive ~50-200ms after the corresponding audio chunk and are useful for barge-in handling, subtitle synchronization, and lip-sync.

Voices

List Available Voices

# List all available voices
voices = client.voices.list()

for voice in voices:
    print(f"{voice.id}: {voice.name}")
    print(f"  Category: {voice.category}")
    print(f"  Languages: {', '.join(voice.supported_languages)}")

# Filter by language
german_voices = client.voices.list(language="de")

# Get only public voices
public_voices = client.voices.list(include_public=True)

# Limit results
first_10 = client.voices.list(limit=10)

Get a Specific Voice

voice = client.voices.get(voice_id=123)
print(f"Voice: {voice.name}")
print(f"Category: {voice.category}")

Models

List Available Models

models = client.models.list()

for model in models:
    print(f"{model.id}: {model.name}")
    print(f"  Description: {model.description}")
    print(f"  Parameters: {model.parameters}")
    print(f"  Max Input: {model.max_input_length} characters")
    print(f"  Sample Rate: {model.sample_rate} Hz")

Error Handling

from kugelaudio import KugelAudio
from kugelaudio.exceptions import (
    KugelAudioError,
    AuthenticationError,
    RateLimitError,
    InsufficientCreditsError,
    ValidationError,
    ConnectionError,
)

try:
    audio = client.tts.generate(text="Hello!")
except AuthenticationError:
    print("Invalid API key")
except RateLimitError:
    print("Rate limit exceeded, please wait")
except InsufficientCreditsError:
    print("Not enough credits, please top up")
except ValidationError as e:
    print(f"Invalid request: {e}")
except ConnectionError:
    print("Failed to connect to server")
except KugelAudioError as e:
    print(f"API error: {e}")

Data Models

AudioChunk

Represents a single audio chunk from streaming:

class AudioChunk:
    audio: bytes          # Raw PCM16 audio data
    encoding: str         # 'pcm_s16le'
    index: int           # Chunk index (0-based)
    sample_rate: int     # Sample rate (24000)
    samples: int         # Number of samples in chunk
    
    @property
    def duration_seconds(self) -> float:
        """Duration of this chunk in seconds."""

AudioResponse

Complete audio response from generation:

class AudioResponse:
    audio: bytes              # Complete PCM16 audio
    sample_rate: int          # Sample rate (24000)
    samples: int              # Total samples
    duration_ms: float        # Duration in milliseconds
    generation_ms: float      # Generation time in milliseconds
    rtf: float               # Real-time factor
    
    @property
    def duration_seconds(self) -> float:
        """Duration in seconds."""
    
    def save(self, path: str) -> None:
        """Save as WAV file."""
    
    def to_wav_bytes(self) -> bytes:
        """Get WAV file as bytes."""

WordTimestamp

Word-level time alignment for a generated audio chunk:

class WordTimestamp:
    word: str          # The aligned word
    start_ms: int      # Start time in milliseconds (relative to chunk)
    end_ms: int        # End time in milliseconds (relative to chunk)
    char_start: int    # Start character offset in original text
    char_end: int      # End character offset in original text
    score: float       # Alignment confidence (0.0 - 1.0)

Model

TTS model information:

class Model:
    id: str                   # 'kugel-1-turbo' or 'kugel-1'
    name: str                 # Human-readable name
    description: str          # Model description
    parameters: str           # Parameter count ('1.5B', '7B')
    max_input_length: int     # Maximum input characters
    sample_rate: int          # Output sample rate

Voice

Voice information:

class Voice:
    id: int                          # Voice ID
    voice_id: int                    # Same as id (backward compat)
    name: str                        # Voice name
    description: Optional[str]       # Description
    category: str                    # 'premade', 'cloned', 'designed', 'conversational', 'narrative', 'narrative_story', 'characters'
    sex: Optional[str]               # 'male', 'female', 'neutral'
    age: Optional[str]               # 'young', 'middle_aged', 'old'
    supported_languages: List[str]   # ['en', 'de', ...]
    avatar_url: Optional[str]        # Avatar image URL
    sample_url: Optional[str]        # Sample audio URL

Complete Example

from kugelaudio import KugelAudio

# Initialize client
client = KugelAudio(api_key="your_api_key")

# List available models
print("Available Models:")
for model in client.models.list():
    print(f"  - {model.id}: {model.name} ({model.parameters})")

# List available voices
print("\nAvailable Voices:")
for voice in client.voices.list(limit=5):
    print(f"  - {voice.id}: {voice.name}")

# Generate audio
print("\nGenerating audio...")
audio = client.tts.generate(
    text="Welcome to KugelAudio. This is an example of high-quality text-to-speech synthesis.",
    model_id="kugel-1-turbo",
)

print(f"Generated {audio.duration_seconds:.2f}s of audio in {audio.generation_ms:.0f}ms")
print(f"Real-time factor: {audio.rtf:.2f}x")

# Save to file
audio.save("example.wav")
print("Saved to example.wav")

# Close client
client.close()

Getting Started

Speech Generation

Voices

Integrations

SDK Reference

Installation

Quick Start

Pre-connecting for Low Latency

Async Applications (Recommended)

Sync Applications

Client Configuration

Local Development

Text-to-Speech

Basic Generation

Streaming Audio

Async Streaming

Async Generation

Text Normalization

Supported Languages

Spell Tags

LLM Integration: Streaming Sessions

Async Streaming Session

Synchronous Streaming Session

Word Timestamps in Streaming Sessions

Voices

List Available Voices

Get a Specific Voice

Models

List Available Models

Error Handling

Data Models

AudioChunk

AudioResponse

WordTimestamp

Model

Voice

Complete Example

Getting Started

Speech Generation

Voices

Integrations

SDK Reference

​Installation

​Quick Start

​Pre-connecting for Low Latency

​Async Applications (Recommended)

​Sync Applications

​Client Configuration

​Local Development

​Text-to-Speech

​Basic Generation

​Streaming Audio

​Async Streaming

​Async Generation

​Text Normalization

​Supported Languages

​Spell Tags

​LLM Integration: Streaming Sessions

​Async Streaming Session

​Synchronous Streaming Session

​Word Timestamps in Streaming Sessions

​Voices

​List Available Voices

​Get a Specific Voice

​Models

​List Available Models

​Error Handling

​Data Models

​AudioChunk

​AudioResponse

​WordTimestamp

​Model

​Voice

​Complete Example

Installation

Quick Start

Pre-connecting for Low Latency

Async Applications (Recommended)

Sync Applications

Client Configuration

Local Development

Text-to-Speech

Basic Generation

Streaming Audio

Async Streaming

Async Generation

Text Normalization

Supported Languages

Spell Tags

LLM Integration: Streaming Sessions

Async Streaming Session

Synchronous Streaming Session

Word Timestamps in Streaming Sessions

Voices

List Available Voices

Get a Specific Voice

Models

List Available Models

Error Handling

Data Models

AudioChunk

AudioResponse

WordTimestamp

Model

Voice

Complete Example