Why Use KugelAudio with LiveKit?
- Native plugin: Drop-in TTS provider for LiveKit’s
AgentSession - Streaming support: Real-time WebSocket-based audio streaming
- Ultra-low latency: ~39ms time-to-first-audio with
kugel-1-turbo - Simple setup: Works with
VoicePipelineAgentand the newAgentSessionAPI
Installation
livekit-agents>=1.0.0).
Quick Start
Minimal Voice Agent
Set the
KUGELAUDIO_API_KEY environment variable or pass api_key directly to the TTS constructor.Configuration
TTS Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | str | KUGELAUDIO_API_KEY env | Your KugelAudio API key |
model | str | kugel-1-turbo | TTS model (kugel-1-turbo or kugel-1) |
voice_id | int | None | None | Voice ID to use (server default if None) |
sample_rate | int | 24000 | Output sample rate in Hz |
cfg_scale | float | 2.0 | CFG scale for generation quality |
max_new_tokens | int | 2048 | Maximum tokens to generate |
base_url | str | https://api.kugelaudio.com | API base URL |
word_timestamps | bool | True | Enable word-level time alignments |
http_session | ClientSession | None | None | Optional aiohttp session to reuse |
Supported Sample Rates
| Rate | Notes |
|---|---|
24000 | Native rate (recommended) |
22050 | CD quality |
16000 | Wideband telephony |
8000 | Narrowband telephony |
Models
| Model | Parameters | Latency | Quality | Use Case |
|---|---|---|---|---|
kugel-1-turbo | 1.5B | ~39ms TTFA | High | Real-time conversations |
kugel-1 | 7B | ~77ms TTFA | Exceptional | Premium quality applications |
Usage Patterns
Non-Streaming Synthesis
Usesynthesize() for one-shot text-to-speech:
Streaming Synthesis
Usestream() for real-time text input (e.g., from an LLM):
Updating Options at Runtime
You can change TTS options dynamically without creating a new instance:Word-Level Alignment
KugelAudio provides word-level time alignments out of the box. Whenword_timestamps=True (the default), the server performs forced alignment on each audio chunk and delivers per-word timing information alongside the audio.
LiveKit’s AgentSession uses these timings automatically for accurate barge-in handling and transcript synchronization via the aligned_transcript capability.
Word alignments add no extra audio latency. The alignment runs on the same GPU as the TTS model and timestamps are delivered shortly after each audio chunk (~50-200ms).
Plugin Registration
You can also register KugelAudio as a LiveKit plugin namespace:Complete Voice Agent Example
Here’s a production-ready voice agent with metrics logging:Running the Agent
Environment Variables
| Variable | Required | Description |
|---|---|---|
KUGELAUDIO_API_KEY | Yes | Your KugelAudio API key |
LIVEKIT_URL | Yes | Your LiveKit server URL |
LIVEKIT_API_KEY | Yes | LiveKit API key |
LIVEKIT_API_SECRET | Yes | LiveKit API secret |
KUGELAUDIO_VOICE_ID | No | Default voice ID to use |
DEEPGRAM_API_KEY | Yes* | Required if using Deepgram STT |
OPENAI_API_KEY | Yes* | Required if using OpenAI LLM |
Troubleshooting
API key not found
API key not found
Make sure
KUGELAUDIO_API_KEY is set in your environment or pass api_key directly:WebSocket connection fails
WebSocket connection fails
Verify your
base_url is correct and the KugelAudio API is reachable. The plugin connects via WebSocket (wss://) for audio streaming.Audio quality issues
Audio quality issues
- Use the native
24000Hz sample rate for best results - Try increasing
cfg_scale(e.g.,2.5) for more expressive output - Switch to
kugel-1model for premium quality
High latency
High latency
- Use
kugel-1-turbofor real-time conversations - Lower
cfg_scale(e.g.,1.5) trades slight quality for speed - Reuse
http_sessionacross requests to avoid connection overhead