Skip to main content

Generate Speech

Generate audio from text. Returns complete audio after generation.
POST

Request Body

text
string
required
The text to convert to speech. Maximum length depends on the model.
model_id
string
default:"kugel-1-turbo"
The model to use. Options: kugel-1-turbo, kugel-1
voice_id
integer
The voice ID to use. If not specified, uses the default voice.
cfg_scale
number
default:"2.0"
Classifier-free guidance scale. Range: 1.0-5.0. Higher values = more expressive.
max_new_tokens
integer
default:"2048"
Maximum tokens to generate. Limits output length.
sample_rate
integer
default:"24000"
Output sample rate in Hz. Options: 8000, 16000, 22050, 24000.Audio is generated natively at 24kHz. Lower rates use server-side resampling with minimal latency impact (~0.1ms per chunk).
normalize
boolean
default:"true"
Enable text normalization (converts numbers, dates, etc. to spoken words).For best performance, always specify the language parameter to skip auto-detection (~150ms latency).
language
string
ISO 639-1 language code for text normalization (e.g., ‘de’, ‘en’, ‘fr’).Supported: de, en, fr, es, it, pt, nl, pl, sv, da, no, fi, cs, hu, ro, el, uk, bg, tr, vi, ar, hi, zh, ja, koIf not provided and normalize is true, language will be auto-detected (adds ~150ms latency).

Spell Tags

Use <spell> tags to spell out text letter by letter. This is useful for:
  • Email addresses
  • Acronyms and abbreviations
  • Serial numbers or codes
  • Any text that should be pronounced character by character
{
  "text": "My email is <spell>[email protected]</spell>",
  "normalize": true,
  "language": "en"
}
Output: “My email is K, A, J, O, at, K, U, G, E, L, A, U, D, I, O, dot, C, O, M”
Spell tags require normalize: true. Special characters are translated to language-specific words:
  • English: @ → “at”, . → “dot”
  • German: @ → “ät”, . → “Punkt”
  • French: @ → “arobase”, . → “point”
Model recommendation: Spell tags work best with kugel-1 for clearer letter-by-letter pronunciation. Use kugel-1-turbo when latency is critical, but expect slightly less precise spelling.

Response

Returns raw PCM16 audio as a streaming binary response (audio/pcm). Response Headers:
HeaderValueDescription
Content-Typeaudio/pcmRaw PCM audio stream
X-Sample-Rate24000Sample rate of the audio
X-Audio-Formatpcm_s16leAudio encoding format
The response body is raw PCM 16-bit signed little-endian audio data streamed as binary chunks.

Example

curl -X POST "https://api.kugelaudio.com/v1/tts/generate" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a test of the KugelAudio API.",
    "model_id": "kugel-1-turbo",
    "voice_id": 123,
    "cfg_scale": 2.0
  }'

Stream Speech (WebSocket)

Stream audio chunks as they’re generated for lower latency.
WebSocket

Connection

Connect with your API key:
wss://api.kugelaudio.com/ws/tts?api_key=YOUR_API_KEY

Request Message

Send a JSON message to start generation:
{
  "text": "Hello, this is streaming audio.",
  "model_id": "kugel-1-turbo",
  "voice_id": 123,
  "cfg_scale": 2.0,
  "normalize": true,
  "language": "en"
}
Text Normalization: Set normalize: true to convert numbers, dates, and symbols to spoken words. Always specify language to avoid ~150ms auto-detection latency.
Spell Tags in Streaming: You can use <spell> tags even when streaming text token-by-token. The system automatically buffers text until spell tags are complete before generating audio. If a stream ends with an incomplete tag (e.g., connection drops), the tag is auto-closed.

Response Messages

Audio Chunk

{
  "audio": "base64_encoded_pcm16_data",
  "enc": "pcm_s16le",
  "idx": 0,
  "sr": 24000,
  "samples": 4800
}

Final Message

{
  "final": true,
  "chunks": 10,
  "total_samples": 48000,
  "dur_ms": 2000,
  "gen_ms": 150,
  "rtf": 0.075
}

Example

import asyncio
import websockets
import json
import base64

async def stream_tts():
    uri = "wss://api.kugelaudio.com/ws/tts?api_key=YOUR_API_KEY"
    audio_chunks = []
    
    async with websockets.connect(uri) as ws:
        # Send request
        await ws.send(json.dumps({
            "text": "Hello, this is streaming audio.",
            "model_id": "kugel-1-turbo",
            "voice_id": 268,
            "cfg_scale": 2.0,
        }))
        
        # Receive chunks
        async for message in ws:
            data = json.loads(message)
            
            if "audio" in data:
                audio_chunks.append(base64.b64decode(data["audio"]))
                print(f"Chunk {data['idx']}: {data['samples']} samples")
            
            if data.get("final"):
                print(f"Complete: {data['dur_ms']}ms audio in {data['gen_ms']}ms")
                break

asyncio.run(stream_tts())

Stream Input (WebSocket)

Stream text input token-by-token for LLM integration.
WebSocket

Connection

wss://api.kugelaudio.com/ws/tts/stream?api_key=YOUR_API_KEY

Protocol

  1. Send config: Initial configuration message
  2. Send text: Text chunks as they arrive
  3. Send flush: Force generation of buffered text
  4. Send close: End the session
  5. Receive audio: Audio chunks as they’re generated

Messages

Config Message

{
  "voice_id": 123,
  "model_id": "kugel-1-turbo",
  "cfg_scale": 2.0,
  "sample_rate": 24000
}

Text Message

{
  "text": "chunk of text"
}

Flush Message

{
  "flush": true
}

Close Message

{
  "close": true
}

Example

import asyncio
import websockets
import json
import base64

async def stream_from_llm(llm_tokens):
    uri = "wss://api.kugelaudio.com/ws/tts/stream?api_key=YOUR_API_KEY"
    
    async with websockets.connect(uri) as ws:
        # Send config
        await ws.send(json.dumps({
            "voice_id": 123,
            "model_id": "kugel-1-turbo",
            "cfg_scale": 2.0,
        }))
        
        # Stream tokens
        for token in llm_tokens:
            await ws.send(json.dumps({"text": token}))
            
            # Check for audio (non-blocking)
            try:
                message = await asyncio.wait_for(ws.recv(), timeout=0.01)
                data = json.loads(message)
                if "audio" in data:
                    audio_bytes = base64.b64decode(data["audio"])
                    play_audio(audio_bytes)
            except asyncio.TimeoutError:
                pass
        
        # Flush and close
        await ws.send(json.dumps({"flush": True}))
        await ws.send(json.dumps({"close": True}))
        
        # Receive remaining audio
        async for message in ws:
            data = json.loads(message)
            if "audio" in data:
                audio_bytes = base64.b64decode(data["audio"])
                play_audio(audio_bytes)
            if data.get("session_closed"):
                break

# Example usage
tokens = ["Hello, ", "this ", "is ", "streaming ", "from ", "an ", "LLM."]
asyncio.run(stream_from_llm(tokens))

Multi-Context Streaming (WebSocket)

Manage up to 5 independent audio streams over a single WebSocket connection. Useful for multi-speaker conversations, pre-buffering, and interleaved audio.
WebSocket

Connection

wss://api.kugelaudio.com/ws/tts/multi?api_key=YOUR_API_KEY

Client → Server Messages

MessageDescription
{"text": " ", "context_id": "ctx1", "voice_settings": {"voice_id": 123}}Initialize context with voice
{"text": "Hello", "context_id": "ctx1"}Send text to context
{"text": "...", "context_id": "ctx1", "flush": true}Send text and flush buffer
{"flush": true, "context_id": "ctx1"}Flush context buffer
{"close_context": true, "context_id": "ctx1"}Close specific context
{"close_socket": true}Close all contexts and connection

Server → Client Messages

MessageDescription
{"context_created": true, "context_id": "ctx1"}Context created
{"generation_started": true, "context_id": "ctx1", "chunk_id": 0, "text": "..."}Generation started
{"audio": "base64...", "enc": "pcm_s16le", "context_id": "ctx1", "idx": 0, "sr": 24000}Audio chunk
{"chunk_complete": true, "context_id": "ctx1", "chunk_id": 0, "audio_seconds": 1.2}Chunk complete
{"is_final": true, "context_id": "ctx1"}All generation complete for context
{"context_closed": true, "context_id": "ctx1"}Context closed
{"session_closed": true, "total_audio_seconds": 5.4}Session ended

Voice Settings

When creating a context, pass voice settings as a nested object:
{
  "voice_settings": {
    "voice_id": 123,
    "cfg_scale": 2.0,
    "max_new_tokens": 2048
  }
}

Example

import asyncio
import websockets
import json
import base64

async def multi_speaker():
    uri = "wss://api.kugelaudio.com/ws/tts/multi?api_key=YOUR_API_KEY"
    
    async with websockets.connect(uri) as ws:
        # Create narrator context
        await ws.send(json.dumps({
            "text": " ",
            "context_id": "narrator",
            "voice_settings": {"voice_id": 123},
        }))
        
        # Create character context
        await ws.send(json.dumps({
            "text": " ",
            "context_id": "character",
            "voice_settings": {"voice_id": 456},
        }))
        
        # Send text to different speakers
        await ws.send(json.dumps({
            "text": "The story begins.",
            "context_id": "narrator",
            "flush": True,
        }))
        
        await ws.send(json.dumps({
            "text": "Hello, I'm the main character!",
            "context_id": "character",
            "flush": True,
        }))
        
        # Receive audio from both contexts
        async for message in ws:
            data = json.loads(message)
            
            if "audio" in data:
                ctx = data["context_id"]
                audio_bytes = base64.b64decode(data["audio"])
                print(f"[{ctx}] Chunk {data['idx']}: {len(audio_bytes)} bytes")
            
            if data.get("session_closed"):
                break
        
        # Close when done
        await ws.send(json.dumps({"close_socket": True}))

asyncio.run(multi_speaker())

Limits

  • Maximum 5 concurrent contexts per connection
  • Contexts auto-close after 20 seconds of inactivity

Response Fields

Audio Chunk Fields (WebSocket)

FieldTypeDescription
audiostringBase64-encoded PCM16 audio data
encstringAudio encoding (always pcm_s16le)
idxintegerChunk index (0-based)
srintegerSample rate in Hz
samplesintegerNumber of samples in this chunk

Streaming Stats

FieldTypeDescription
finalbooleanIndicates generation complete
chunksintegerNumber of chunks generated
total_samplesintegerTotal audio samples generated
dur_msnumberTotal audio duration in ms
gen_msnumberTotal generation time in ms
rtfnumberReal-time factor (gen_ms / dur_ms)

Error Responses

Validation Error

{
  "error": {
    "code": "invalid_request",
    "message": "Text exceeds maximum length",
    "details": {
      "max_length": 4096,
      "provided_length": 5000
    }
  }
}

Voice Not Found

{
  "error": {
    "code": "not_found",
    "message": "Voice not found",
    "details": {
      "voice_id": 999
    }
  }
}

Rate Limited

{
  "error": {
    "code": "rate_limited",
    "message": "Too many requests",
    "details": {
      "retry_after": 60
    }
  }
}