The official Python SDK for KugelAudio provides a simple, Pythonic interface for text-to-speech generation with both synchronous and asynchronous support.
Installation
Or with uv (recommended):
Quick Start
from kugelaudio import KugelAudio
# Initialize the client
client = KugelAudio(api_key="your_api_key")
# Generate speech
audio = client.tts.generate(
text="Hello, world!",
model_id="kugel-1-turbo",
)
# Save to file
audio.save("output.wav")
Pre-connecting for Low Latency
For latency-sensitive applications, pre-establish the WebSocket connection at startup to eliminate cold start latency (~500ms) from your first TTS request.
Async Applications (Recommended)
import asyncio
from kugelaudio import KugelAudio
async def main():
# Create a pre-connected client (~500ms happens here)
client = await KugelAudio.create(api_key="your_api_key")
# First request is now fast (~100-150ms TTFA instead of ~600ms)
async for chunk in client.tts.stream_async("Hello, world!"):
if hasattr(chunk, 'audio'):
play_audio(chunk.audio)
await client.aclose()
asyncio.run(main())
Sync Applications
For synchronous code, manually call connect() at startup:
from kugelaudio import KugelAudio
# Initialize client
client = KugelAudio(api_key="your_api_key")
# Pre-connect at startup (~500ms happens here)
client.connect()
# Check connection status
print(f"Connected: {client.is_connected()}")
# First request is now fast
for chunk in client.tts.stream("Hello, world!"):
if hasattr(chunk, 'audio'):
play_audio(chunk.audio)
Without pre-connecting, the first TTS request includes WebSocket connection setup (~500ms).
Subsequent requests reuse the connection and are fast (~100-150ms TTFA).
Pre-connecting moves this overhead to application startup.
Client Configuration
from kugelaudio import KugelAudio
# Simple setup
client = KugelAudio(api_key="your_api_key")
# With custom options
client = KugelAudio(
api_key="your_api_key", # Required: Your API key
api_url="https://api.kugelaudio.com", # Optional: API base URL
timeout=60.0, # Optional: Request timeout in seconds
)
Local Development
For local development, point directly to your TTS server:
client = KugelAudio(
api_key="your_api_key",
api_url="http://localhost:8000",
)
Or with separate backend and TTS servers:
client = KugelAudio(
api_key="your_api_key",
api_url="http://localhost:8001", # Backend for REST API
tts_url="http://localhost:8000", # TTS server for WebSocket streaming
)
Text-to-Speech
Basic Generation
Generate complete audio and receive it all at once:
audio = client.tts.generate(
text="Hello, this is a test of the KugelAudio text-to-speech system.",
model_id="kugel-1-turbo", # 'kugel-1-turbo' (fast) or 'kugel-1' (quality)
voice_id=123, # Optional: specific voice ID
cfg_scale=2.0, # Guidance scale (1.0-5.0)
max_new_tokens=2048, # Maximum tokens to generate
sample_rate=24000, # Output sample rate
normalize=True, # Enable text normalization (default)
language="en", # Language for normalization (see below)
word_timestamps=False, # Request word-level timestamps (default: False)
)
# Audio properties
print(f"Duration: {audio.duration_seconds:.2f}s")
print(f"Samples: {audio.samples}")
print(f"Sample rate: {audio.sample_rate} Hz")
print(f"Generation time: {audio.generation_ms:.0f}ms")
print(f"RTF: {audio.rtf:.2f}") # Real-time factor
# Save to WAV file
audio.save("output.wav")
# Get raw PCM bytes
pcm_data = audio.audio
# Get WAV bytes (with header)
wav_bytes = audio.to_wav_bytes()
Streaming Audio
Receive audio chunks as they are generated for lower latency:
# Synchronous streaming
for item in client.tts.stream(
text="Hello, this is streaming audio.",
model_id="kugel-1-turbo",
):
if hasattr(item, 'audio'): # AudioChunk
# Process audio chunk immediately
print(f"Chunk {item.index}: {len(item.audio)} bytes, {item.samples} samples")
# play_audio(item.audio)
elif isinstance(item, dict) and item.get('final'):
# Final stats
print(f"Total duration: {item.get('dur_ms', 0):.0f}ms")
print(f"Generation time: {item.get('gen_ms', 0):.0f}ms")
Async Streaming
For async applications:
import asyncio
async def generate_speech():
async for item in client.tts.stream_async(
text="Async streaming example.",
model_id="kugel-1-turbo",
):
if hasattr(item, 'audio'):
# Process chunk
pass
asyncio.run(generate_speech())
Async Generation
import asyncio
async def main():
audio = await client.tts.generate_async(
text="Async generation example.",
model_id="kugel-1-turbo",
)
audio.save("async_output.wav")
asyncio.run(main())
Text Normalization
Text normalization converts numbers, dates, times, and other non-verbal text into spoken words:
- “I have 3 apples” → “I have three apples”
- “The meeting is at 2:30 PM” → “The meeting is at two thirty PM”
- “€50.99” → “fifty euros and ninety-nine cents”
# With explicit language (recommended - fastest)
audio = client.tts.generate(
text="I bought 3 items for €50.99 on 01/15/2024.",
normalize=True,
language="en", # Specify language for best performance
)
# With auto-detection (may cause incorrect normalizations)
audio = client.tts.generate(
text="Ich habe 3 Artikel für 50,99€ gekauft.",
normalize=True,
# language not specified - will auto-detect
)
Supported Languages
| Code | Language | Code | Language |
|---|
de | German | nl | Dutch |
en | English | pl | Polish |
fr | French | sv | Swedish |
es | Spanish | da | Danish |
it | Italian | no | Norwegian |
pt | Portuguese | fi | Finnish |
cs | Czech | hu | Hungarian |
ro | Romanian | el | Greek |
uk | Ukrainian | bg | Bulgarian |
tr | Turkish | vi | Vietnamese |
ar | Arabic | hi | Hindi |
zh | Chinese | ja | Japanese |
ko | Korean | | |
Using normalize=True without specifying language may cause incorrect normalizations, especially for short texts or languages that share similar vocabulary. Always specify language when you know it.
Use <spell> tags to spell out text letter by letter. This is useful for email addresses, codes, acronyms, or any text that should be pronounced character by character:
# Spell out an email address
audio = client.tts.generate(
text="Contact me at <spell>kajo@kugelaudio.com</spell>",
normalize=True,
language="en",
)
# Output: "Contact me at K, A, J, O, at, K, U, G, E, L, A, U, D, I, O, dot, C, O, M"
# Spell out an acronym
audio = client.tts.generate(
text="The <spell>API</spell> is easy to use.",
normalize=True,
language="en",
)
# Output: "The A, P, I is easy to use."
# German example with language-specific translations
audio = client.tts.generate(
text="Meine E-Mail ist <spell>test@beispiel.de</spell>",
normalize=True,
language="de",
)
# Output: "Meine E-Mail ist T, E, S, T, ät, B, E, I, S, P, I, E, L, Punkt, D, E"
Spell tags also work with streaming:
# Streaming with spell tags - tags spanning chunks are handled automatically
async with client.tts.streaming_session(
voice_id=123,
normalize=True,
language="en",
) as session:
# Even if the tag is split across tokens, it works correctly
async for chunk in session.send("My code is <spell>"):
play_audio(chunk.audio)
async for chunk in session.send("ABC123</spell>"):
play_audio(chunk.audio)
async for chunk in session.flush():
play_audio(chunk.audio)
Special Characters: Characters like @, ., - are translated to language-specific words.
For example, @ becomes “at” in English, “ät” in German, and “arobase” in French.
Model recommendation: For clearer letter-by-letter pronunciation, use model_id="kugel-1" instead of kugel-1-turbo.
LLM Integration: Streaming Sessions
For real-time TTS when streaming text from an LLM (like GPT-4, Claude, etc.):
Async Streaming Session
import asyncio
async def stream_from_llm():
# Simulate LLM token stream
llm_tokens = ["Hello, ", "this ", "is ", "a ", "streamed ", "response."]
async with client.tts.streaming_session(
voice_id=123,
cfg_scale=2.0,
flush_timeout_ms=500, # Auto-flush after 500ms of no input
) as session:
# Send tokens as they arrive from LLM
for token in llm_tokens:
async for chunk in session.send(token):
# Play audio chunk immediately
play_audio(chunk.audio)
# Flush any remaining text
async for chunk in session.flush():
play_audio(chunk.audio)
asyncio.run(stream_from_llm())
Synchronous Streaming Session
with client.tts.streaming_session_sync(voice_id=123) as session:
for token in llm_tokens:
for chunk in session.send(token):
play_audio(chunk.audio)
for chunk in session.flush():
play_audio(chunk.audio)
Word Timestamps with Generate
Request word-level time alignments alongside audio when using generate():
audio = client.tts.generate(
text="Hello, how are you today?",
model_id="kugel-1-turbo",
word_timestamps=True,
)
# Access word timestamps from the response
for ts in audio.word_timestamps:
print(f"{ts.word}: {ts.start_ms}ms - {ts.end_ms}ms (score: {ts.score:.2f})")
# Example output:
# Hello: 0ms - 320ms (score: 0.98)
# how: 350ms - 480ms (score: 0.95)
# are: 500ms - 580ms (score: 0.97)
# you: 600ms - 720ms (score: 0.96)
# today: 750ms - 1100ms (score: 0.94)
Word timestamps are also available with async generation:
audio = await client.tts.generate_async(
text="Hello, world!",
model_id="kugel-1-turbo",
word_timestamps=True,
)
for ts in audio.word_timestamps:
print(f"{ts.word}: {ts.start_ms}-{ts.end_ms}ms")
Word Timestamps in Streaming
Word timestamps work with all streaming methods. During streaming, they are yielded as list[WordTimestamp] objects between audio chunks:
from kugelaudio.models import WordTimestamp
for item in client.tts.stream(
text="Hello, how are you today?",
model_id="kugel-1-turbo",
word_timestamps=True,
):
if hasattr(item, 'audio'): # AudioChunk
play_audio(item.audio)
elif isinstance(item, list) and item and isinstance(item[0], WordTimestamp):
for ts in item:
print(f"{ts.word}: {ts.start_ms}-{ts.end_ms}ms")
Word Timestamps in Streaming Sessions
Request word-level time alignments alongside audio. Timestamps are delivered per chunk after the corresponding audio data:
async with client.tts.streaming_session(
voice_id=123,
word_timestamps=True,
) as session:
async for chunk in session.send("Hello, how are you today?"):
play_audio(chunk.audio)
async for chunk in session.flush():
play_audio(chunk.audio)
# Access the latest word timestamps
timestamps = session.last_word_timestamps
for ts in timestamps:
print(f"{ts.word}: {ts.start_ms}ms - {ts.end_ms}ms (score: {ts.score:.2f})")
You can also register a callback to process timestamps as they arrive:
def on_timestamps(timestamps):
for ts in timestamps:
print(f" {ts.word} [{ts.start_ms}-{ts.end_ms}ms]")
async with client.tts.streaming_session(
voice_id=123,
on_word_timestamps=on_timestamps,
) as session:
async for chunk in session.send("Hello world!"):
play_audio(chunk.audio)
async for chunk in session.flush():
play_audio(chunk.audio)
Word timestamps add no extra audio latency. They arrive ~50-200ms after the corresponding audio chunk and are useful for barge-in handling, subtitle synchronization, and lip-sync.
Voices
List Available Voices
# List all available voices
voices = client.voices.list()
for voice in voices:
print(f"{voice.id}: {voice.name}")
print(f" Category: {voice.category}")
print(f" Languages: {', '.join(voice.supported_languages)}")
# Filter by language
german_voices = client.voices.list(language="de")
# Get only public voices
public_voices = client.voices.list(include_public=True)
# Limit results
first_10 = client.voices.list(limit=10)
Get a Specific Voice
voice = client.voices.get(voice_id=123)
print(f"Voice: {voice.name}")
print(f"Category: {voice.category}")
Create a Voice
Create a new voice with optional reference audio files:
voice = client.voices.create(
name="My Custom Voice",
sex="female",
description="A warm, conversational voice",
category="cloned",
reference_files=["reference1.wav", "reference2.wav"],
)
print(f"Created voice: {voice.id}")
The reference_files parameter accepts file paths (str or Path) to audio files (WAV, MP3, FLAC).
Update a Voice
voice = client.voices.update(
voice_id=123,
name="Updated Name",
description="New description",
)
Delete a Voice
client.voices.delete(voice_id=123)
Manage Reference Audio
# List references for a voice
refs = client.voices.list_references(voice_id=123)
for ref in refs:
print(f"{ref.id}: {ref.name}")
# Add a new reference
ref = client.voices.add_reference(
voice_id=123,
file_path="new_reference.wav",
reference_text="Optional transcript of the audio.",
)
# Delete a reference
client.voices.delete_reference(voice_id=123, reference_id=456)
Publish a Voice
Request that your voice be made publicly available. An admin will verify it before it becomes visible to others.
voice = client.voices.publish(voice_id=123)
print(f"Pending verification: {voice.pending_verification}")
Generate Voice Sample
Trigger sample audio generation for a voice:
voice = client.voices.generate_sample(voice_id=123)
print(f"Sample URL: {voice.sample_url}")
Models
List Available Models
models = client.models.list()
for model in models:
print(f"{model.id}: {model.name}")
print(f" Description: {model.description}")
print(f" Parameters: {model.parameters}")
print(f" Max Input: {model.max_input_length} characters")
print(f" Sample Rate: {model.sample_rate} Hz")
Error Handling
from kugelaudio import KugelAudio
from kugelaudio.exceptions import (
KugelAudioError,
AuthenticationError,
RateLimitError,
InsufficientCreditsError,
ValidationError,
)
try:
audio = client.tts.generate(text="Hello!")
except AuthenticationError:
print("Invalid API key")
except RateLimitError:
print("Rate limit exceeded, please wait")
except InsufficientCreditsError:
print("Not enough credits, please top up")
except ValidationError as e:
print(f"Invalid request: {e}")
except KugelAudioError as e:
print(f"API error: {e}")
Data Models
AudioChunk
Represents a single audio chunk from streaming:
class AudioChunk:
audio: bytes # Raw PCM16 audio data
encoding: str # 'pcm_s16le'
index: int # Chunk index (0-based)
sample_rate: int # Sample rate (24000)
samples: int # Number of samples in chunk
@property
def duration_seconds(self) -> float:
"""Duration of this chunk in seconds."""
AudioResponse
Complete audio response from generation:
class AudioResponse:
audio: bytes # Complete PCM16 audio
sample_rate: int # Sample rate (24000)
samples: int # Total samples
duration_ms: float # Duration in milliseconds
generation_ms: float # Generation time in milliseconds
rtf: float # Real-time factor
word_timestamps: list[WordTimestamp] # Per-word timing (when word_timestamps=True)
@property
def duration_seconds(self) -> float:
"""Duration in seconds."""
def save(self, path: str) -> None:
"""Save as WAV file."""
def to_wav_bytes(self) -> bytes:
"""Get WAV file as bytes."""
WordTimestamp
Word-level time alignment for a generated audio chunk:
class WordTimestamp:
word: str # The aligned word
start_ms: int # Start time in milliseconds (relative to chunk)
end_ms: int # End time in milliseconds (relative to chunk)
char_start: int # Start character offset in original text
char_end: int # End character offset in original text
score: float # Alignment confidence (0.0 - 1.0)
Model
TTS model information:
class Model:
id: str # 'kugel-1-turbo' or 'kugel-1'
name: str # Human-readable name
description: str # Model description
max_input_length: int # Maximum input characters
sample_rate: int # Output sample rate
Voice
Voice information (returned by list):
class Voice:
id: int # Voice ID
voice_id: int # Same as id (backward compat)
name: str # Voice name
description: Optional[str] # Description
category: str # 'premade', 'cloned', 'designed', 'conversational', 'narrative', 'narrative_story', 'characters'
sex: Optional[str] # 'male', 'female', 'neutral'
age: Optional[str] # 'young', 'middle_aged', 'old'
supported_languages: List[str] # ['en', 'de', ...]
avatar_url: Optional[str] # Avatar image URL
sample_url: Optional[str] # Sample audio URL
VoiceDetail
Extended voice information (returned by create, update, get, publish, generate_sample):
class VoiceDetail:
id: int
name: str
description: str
generative_voice_description: str
supported_languages: List[str]
category: str
age: Optional[str]
sex: Optional[str]
quality: str # 'low', 'mid', 'high'
is_public: bool
verified: bool
pending_verification: bool
sample_url: Optional[str]
avatar_url: Optional[str]
sample_text: str
VoiceReference
Voice reference audio metadata:
class VoiceReference:
id: int
voice_id: int
name: str
reference_text: str
s3_path: str
audio_url: Optional[str]
is_generated: bool
Complete Example
from kugelaudio import KugelAudio
# Initialize client
client = KugelAudio(api_key="your_api_key")
# List available models
print("Available Models:")
for model in client.models.list():
print(f" - {model.id}: {model.name} ({model.parameters})")
# List available voices
print("\nAvailable Voices:")
for voice in client.voices.list(limit=5):
print(f" - {voice.id}: {voice.name}")
# Generate audio
print("\nGenerating audio...")
audio = client.tts.generate(
text="Welcome to KugelAudio. This is an example of high-quality text-to-speech synthesis.",
model_id="kugel-1-turbo",
)
print(f"Generated {audio.duration_seconds:.2f}s of audio in {audio.generation_ms:.0f}ms")
print(f"Real-time factor: {audio.rtf:.2f}x")
# Save to file
audio.save("example.wav")
print("Saved to example.wav")
# Close client
client.close()