The official Python SDK for KugelAudio provides a simple, Pythonic interface for text-to-speech generation with both synchronous and asynchronous support.
Installation
Or with uv (recommended):
Quick Start
from kugelaudio import KugelAudio
# Initialize the client
client = KugelAudio(api_key="your_api_key")
# Generate speech
audio = client.tts.generate(
text="Hello, world!",
model_id="kugel-1-turbo",
)
# Save to file
audio.save("output.wav")
Pre-connecting for Low Latency
For latency-sensitive applications, pre-establish the WebSocket connection at startup to eliminate cold start latency (~500ms) from your first TTS request.
Async Applications (Recommended)
import asyncio
from kugelaudio import KugelAudio
async def main():
# Create a pre-connected client (~500ms happens here)
client = await KugelAudio.create(api_key="your_api_key")
# First request is now fast (~100-150ms TTFA instead of ~600ms)
async for chunk in client.tts.stream_async("Hello, world!"):
if hasattr(chunk, 'audio'):
play_audio(chunk.audio)
await client.aclose()
asyncio.run(main())
Sync Applications
For synchronous code, manually call connect() at startup:
from kugelaudio import KugelAudio
# Initialize client
client = KugelAudio(api_key="your_api_key")
# Pre-connect at startup (~500ms happens here)
client.connect()
# Check connection status
print(f"Connected: {client.is_connected()}")
# First request is now fast
for chunk in client.tts.stream("Hello, world!"):
if hasattr(chunk, 'audio'):
play_audio(chunk.audio)
Without pre-connecting, the first TTS request includes WebSocket connection setup (~500ms).
Subsequent requests reuse the connection and are fast (~100-150ms TTFA).
Pre-connecting moves this overhead to application startup.
Client Configuration
from kugelaudio import KugelAudio
# Simple setup
client = KugelAudio(api_key="your_api_key")
# With custom options
client = KugelAudio(
api_key="your_api_key", # Required: Your API key
api_url="https://api.kugelaudio.com", # Optional: API base URL
timeout=60.0, # Optional: Request timeout in seconds
)
Local Development
For local development, point directly to your TTS server:
client = KugelAudio(
api_key="your_api_key",
api_url="http://localhost:8000",
)
Or with separate backend and TTS servers:
client = KugelAudio(
api_key="your_api_key",
api_url="http://localhost:8001", # Backend for REST API
tts_url="http://localhost:8000", # TTS server for WebSocket streaming
)
Text-to-Speech
Basic Generation
Generate complete audio and receive it all at once:
audio = client.tts.generate(
text="Hello, this is a test of the KugelAudio text-to-speech system.",
model_id="kugel-1-turbo", # 'kugel-1-turbo' (fast) or 'kugel-1' (quality)
voice_id=123, # Optional: specific voice ID
cfg_scale=2.0, # Guidance scale (1.0-5.0)
max_new_tokens=2048, # Maximum tokens to generate
sample_rate=24000, # Output sample rate
normalize=True, # Enable text normalization (default)
language="en", # Language for normalization (see below)
)
# Audio properties
print(f"Duration: {audio.duration_seconds:.2f}s")
print(f"Samples: {audio.samples}")
print(f"Sample rate: {audio.sample_rate} Hz")
print(f"Generation time: {audio.generation_ms:.0f}ms")
print(f"RTF: {audio.rtf:.2f}") # Real-time factor
# Save to WAV file
audio.save("output.wav")
# Get raw PCM bytes
pcm_data = audio.audio
# Get WAV bytes (with header)
wav_bytes = audio.to_wav_bytes()
Streaming Audio
Receive audio chunks as they are generated for lower latency:
# Synchronous streaming
for item in client.tts.stream(
text="Hello, this is streaming audio.",
model_id="kugel-1-turbo",
):
if hasattr(item, 'audio'): # AudioChunk
# Process audio chunk immediately
print(f"Chunk {item.index}: {len(item.audio)} bytes, {item.samples} samples")
# play_audio(item.audio)
elif isinstance(item, dict) and item.get('final'):
# Final stats
print(f"Total duration: {item.get('dur_ms', 0):.0f}ms")
print(f"Time to first audio: {item.get('ttfa_ms', 0):.0f}ms")
Async Streaming
For async applications:
import asyncio
async def generate_speech():
async for item in client.tts.stream_async(
text="Async streaming example.",
model_id="kugel-1-turbo",
):
if hasattr(item, 'audio'):
# Process chunk
pass
asyncio.run(generate_speech())
Async Generation
import asyncio
async def main():
audio = await client.tts.generate_async(
text="Async generation example.",
model_id="kugel-1-turbo",
)
audio.save("async_output.wav")
asyncio.run(main())
Text Normalization
Text normalization converts numbers, dates, times, and other non-verbal text into spoken words:
- “I have 3 apples” → “I have three apples”
- “The meeting is at 2:30 PM” → “The meeting is at two thirty PM”
- “€50.99” → “fifty euros and ninety-nine cents”
# With explicit language (recommended - fastest)
audio = client.tts.generate(
text="I bought 3 items for €50.99 on 01/15/2024.",
normalize=True,
language="en", # Specify language for best performance
)
# With auto-detection (adds ~150ms latency)
audio = client.tts.generate(
text="Ich habe 3 Artikel für 50,99€ gekauft.",
normalize=True,
# language not specified - will auto-detect
)
Supported Languages
| Code | Language | Code | Language |
|---|
de | German | nl | Dutch |
en | English | pl | Polish |
fr | French | sv | Swedish |
es | Spanish | da | Danish |
it | Italian | no | Norwegian |
pt | Portuguese | fi | Finnish |
cs | Czech | hu | Hungarian |
ro | Romanian | el | Greek |
uk | Ukrainian | bg | Bulgarian |
tr | Turkish | vi | Vietnamese |
ar | Arabic | hi | Hindi |
zh | Chinese | ja | Japanese |
ko | Korean | | |
Using normalize=True without specifying language adds approximately 150ms latency for language auto-detection. For best performance in latency-sensitive applications, always specify the language parameter.
Use <spell> tags to spell out text letter by letter. This is useful for email addresses, codes, acronyms, or any text that should be pronounced character by character:
# Spell out an email address
audio = client.tts.generate(
text="Contact me at <spell>[email protected]</spell>",
normalize=True,
language="en",
)
# Output: "Contact me at K, A, J, O, at, K, U, G, E, L, A, U, D, I, O, dot, C, O, M"
# Spell out an acronym
audio = client.tts.generate(
text="The <spell>API</spell> is easy to use.",
normalize=True,
language="en",
)
# Output: "The A, P, I is easy to use."
# German example with language-specific translations
audio = client.tts.generate(
text="Meine E-Mail ist <spell>[email protected]</spell>",
normalize=True,
language="de",
)
# Output: "Meine E-Mail ist T, E, S, T, ät, B, E, I, S, P, I, E, L, Punkt, D, E"
Spell tags also work with streaming:
# Streaming with spell tags - tags spanning chunks are handled automatically
async with client.tts.streaming_session(
voice_id=123,
normalize=True,
language="en",
) as session:
# Even if the tag is split across tokens, it works correctly
async for chunk in session.send("My code is <spell>"):
play_audio(chunk.audio)
async for chunk in session.send("ABC123</spell>"):
play_audio(chunk.audio)
async for chunk in session.flush():
play_audio(chunk.audio)
Special Characters: Characters like @, ., - are translated to language-specific words.
For example, @ becomes “at” in English, “ät” in German, and “arobase” in French.
Model recommendation: For clearer letter-by-letter pronunciation, use model_id="kugel-1" instead of kugel-1-turbo.
LLM Integration: Streaming Sessions
For real-time TTS when streaming text from an LLM (like GPT-4, Claude, etc.):
Async Streaming Session
import asyncio
async def stream_from_llm():
# Simulate LLM token stream
llm_tokens = ["Hello, ", "this ", "is ", "a ", "streamed ", "response."]
async with client.tts.streaming_session(
voice_id=123,
cfg_scale=2.0,
flush_timeout_ms=500, # Auto-flush after 500ms of no input
) as session:
# Send tokens as they arrive from LLM
for token in llm_tokens:
async for chunk in session.send(token):
# Play audio chunk immediately
play_audio(chunk.audio)
# Flush any remaining text
async for chunk in session.flush():
play_audio(chunk.audio)
asyncio.run(stream_from_llm())
Synchronous Streaming Session
with client.tts.streaming_session_sync(voice_id=123) as session:
for token in llm_tokens:
for chunk in session.send(token):
play_audio(chunk.audio)
for chunk in session.flush():
play_audio(chunk.audio)
Word Timestamps in Streaming Sessions
Request word-level time alignments alongside audio. Timestamps are delivered per chunk after the corresponding audio data:
async with client.tts.streaming_session(
voice_id=123,
word_timestamps=True, # enabled by default
) as session:
async for chunk in session.send("Hello, how are you today?"):
play_audio(chunk.audio)
async for chunk in session.flush():
play_audio(chunk.audio)
# Access the latest word timestamps
timestamps = session.last_word_timestamps
for ts in timestamps:
print(f"{ts.word}: {ts.start_ms}ms - {ts.end_ms}ms (score: {ts.score:.2f})")
You can also register a callback to process timestamps as they arrive:
def on_timestamps(timestamps):
for ts in timestamps:
print(f" {ts.word} [{ts.start_ms}-{ts.end_ms}ms]")
async with client.tts.streaming_session(
voice_id=123,
on_word_timestamps=on_timestamps,
) as session:
async for chunk in session.send("Hello world!"):
play_audio(chunk.audio)
async for chunk in session.flush():
play_audio(chunk.audio)
Word timestamps add no extra audio latency. They arrive ~50-200ms after the corresponding audio chunk and are useful for barge-in handling, subtitle synchronization, and lip-sync.
Voices
List Available Voices
# List all available voices
voices = client.voices.list()
for voice in voices:
print(f"{voice.id}: {voice.name}")
print(f" Category: {voice.category}")
print(f" Languages: {', '.join(voice.supported_languages)}")
# Filter by language
german_voices = client.voices.list(language="de")
# Get only public voices
public_voices = client.voices.list(include_public=True)
# Limit results
first_10 = client.voices.list(limit=10)
Get a Specific Voice
voice = client.voices.get(voice_id=123)
print(f"Voice: {voice.name}")
print(f"Category: {voice.category}")
Models
List Available Models
models = client.models.list()
for model in models:
print(f"{model.id}: {model.name}")
print(f" Description: {model.description}")
print(f" Parameters: {model.parameters}")
print(f" Max Input: {model.max_input_length} characters")
print(f" Sample Rate: {model.sample_rate} Hz")
Error Handling
from kugelaudio import KugelAudio
from kugelaudio.exceptions import (
KugelAudioError,
AuthenticationError,
RateLimitError,
InsufficientCreditsError,
ValidationError,
ConnectionError,
)
try:
audio = client.tts.generate(text="Hello!")
except AuthenticationError:
print("Invalid API key")
except RateLimitError:
print("Rate limit exceeded, please wait")
except InsufficientCreditsError:
print("Not enough credits, please top up")
except ValidationError as e:
print(f"Invalid request: {e}")
except ConnectionError:
print("Failed to connect to server")
except KugelAudioError as e:
print(f"API error: {e}")
Data Models
AudioChunk
Represents a single audio chunk from streaming:
class AudioChunk:
audio: bytes # Raw PCM16 audio data
encoding: str # 'pcm_s16le'
index: int # Chunk index (0-based)
sample_rate: int # Sample rate (24000)
samples: int # Number of samples in chunk
@property
def duration_seconds(self) -> float:
"""Duration of this chunk in seconds."""
AudioResponse
Complete audio response from generation:
class AudioResponse:
audio: bytes # Complete PCM16 audio
sample_rate: int # Sample rate (24000)
samples: int # Total samples
duration_ms: float # Duration in milliseconds
generation_ms: float # Generation time in milliseconds
rtf: float # Real-time factor
@property
def duration_seconds(self) -> float:
"""Duration in seconds."""
def save(self, path: str) -> None:
"""Save as WAV file."""
def to_wav_bytes(self) -> bytes:
"""Get WAV file as bytes."""
WordTimestamp
Word-level time alignment for a generated audio chunk:
class WordTimestamp:
word: str # The aligned word
start_ms: int # Start time in milliseconds (relative to chunk)
end_ms: int # End time in milliseconds (relative to chunk)
char_start: int # Start character offset in original text
char_end: int # End character offset in original text
score: float # Alignment confidence (0.0 - 1.0)
Model
TTS model information:
class Model:
id: str # 'kugel-1-turbo' or 'kugel-1'
name: str # Human-readable name
description: str # Model description
parameters: str # Parameter count ('1.5B', '7B')
max_input_length: int # Maximum input characters
sample_rate: int # Output sample rate
Voice
Voice information:
class Voice:
id: int # Voice ID
voice_id: int # Same as id (backward compat)
name: str # Voice name
description: Optional[str] # Description
category: str # 'premade', 'cloned', 'designed', 'conversational', 'narrative', 'narrative_story', 'characters'
sex: Optional[str] # 'male', 'female', 'neutral'
age: Optional[str] # 'young', 'middle_aged', 'old'
supported_languages: List[str] # ['en', 'de', ...]
avatar_url: Optional[str] # Avatar image URL
sample_url: Optional[str] # Sample audio URL
Complete Example
from kugelaudio import KugelAudio
# Initialize client
client = KugelAudio(api_key="your_api_key")
# List available models
print("Available Models:")
for model in client.models.list():
print(f" - {model.id}: {model.name} ({model.parameters})")
# List available voices
print("\nAvailable Voices:")
for voice in client.voices.list(limit=5):
print(f" - {voice.id}: {voice.name}")
# Generate audio
print("\nGenerating audio...")
audio = client.tts.generate(
text="Welcome to KugelAudio. This is an example of high-quality text-to-speech synthesis.",
model_id="kugel-1-turbo",
)
print(f"Generated {audio.duration_seconds:.2f}s of audio in {audio.generation_ms:.0f}ms")
print(f"Real-time factor: {audio.rtf:.2f}x")
# Save to file
audio.save("example.wav")
print("Saved to example.wav")
# Close client
client.close()