Text-to-Speech

Generate Speech

Generate audio from text. Returns complete audio after generation.

POST

Request Body

text

string

required

The text to convert to speech. Maximum length depends on the model.

model_id

string

default:"kugel-1-turbo"

The model to use. Options: kugel-1-turbo, kugel-1

voice_id

integer

The voice ID to use. If not specified, uses the default voice.

cfg_scale

number

default:"2.0"

Classifier-free guidance scale. Range: 1.0-5.0. Higher values = more expressive.

max_new_tokens

integer

default:"2048"

Maximum tokens to generate. Limits output length.

sample_rate

integer

default:"24000"

Output sample rate in Hz. Options: 8000, 16000, 22050, 24000.Audio is generated natively at 24kHz. Lower rates use server-side resampling with minimal latency impact (~0.1ms per chunk).

normalize

boolean

default:"true"

Enable text normalization (converts numbers, dates, etc. to spoken words).For best performance, always specify the language parameter to skip auto-detection (~150ms latency).

language

string

ISO 639-1 language code for text normalization (e.g., ‘de’, ‘en’, ‘fr’).Supported: de, en, fr, es, it, pt, nl, pl, sv, da, no, fi, cs, hu, ro, el, uk, bg, tr, vi, ar, hi, zh, ja, koIf not provided and normalize is true, language will be auto-detected (adds ~150ms latency).

Spell Tags

Use <spell> tags to spell out text letter by letter. This is useful for:

Email addresses
Acronyms and abbreviations
Serial numbers or codes
Any text that should be pronounced character by character

{
  "text": "My email is <spell>[email protected]</spell>",
  "normalize": true,
  "language": "en"
}

Output: “My email is K, A, J, O, at, K, U, G, E, L, A, U, D, I, O, dot, C, O, M”

Spell tags require normalize: true. Special characters are translated to language-specific words:

English: @ → “at”, . → “dot”
German: @ → “ät”, . → “Punkt”
French: @ → “arobase”, . → “point”

Model recommendation: Spell tags work best with kugel-1 for clearer letter-by-letter pronunciation. Use kugel-1-turbo when latency is critical, but expect slightly less precise spelling.

Response

Returns raw PCM16 audio as a streaming binary response (audio/pcm). Response Headers:

Header	Value	Description
`Content-Type`	`audio/pcm`	Raw PCM audio stream
`X-Sample-Rate`	`24000`	Sample rate of the audio
`X-Audio-Format`	`pcm_s16le`	Audio encoding format

The response body is raw PCM 16-bit signed little-endian audio data streamed as binary chunks.

Example

curl -X POST "https://api.kugelaudio.com/v1/tts/generate" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a test of the KugelAudio API.",
    "model_id": "kugel-1-turbo",
    "voice_id": 123,
    "cfg_scale": 2.0
  }'

Stream Speech (WebSocket)

Stream audio chunks as they’re generated for lower latency.

WebSocket

Connection

Connect with your API key:

wss://api.kugelaudio.com/ws/tts?api_key=YOUR_API_KEY

Request Message

Send a JSON message to start generation:

{
  "text": "Hello, this is streaming audio.",
  "model_id": "kugel-1-turbo",
  "voice_id": 123,
  "cfg_scale": 2.0,
  "normalize": true,
  "language": "en"
}

Text Normalization: Set normalize: true to convert numbers, dates, and symbols to spoken words. Always specify language to avoid ~150ms auto-detection latency.

Spell Tags in Streaming: You can use <spell> tags even when streaming text token-by-token. The system automatically buffers text until spell tags are complete before generating audio. If a stream ends with an incomplete tag (e.g., connection drops), the tag is auto-closed.

Response Messages

Audio Chunk

{
  "audio": "base64_encoded_pcm16_data",
  "enc": "pcm_s16le",
  "idx": 0,
  "sr": 24000,
  "samples": 4800
}

Final Message

{
  "final": true,
  "chunks": 10,
  "total_samples": 48000,
  "dur_ms": 2000,
  "gen_ms": 150,
  "rtf": 0.075
}

Example

import asyncio
import websockets
import json
import base64

async def stream_tts():
    uri = "wss://api.kugelaudio.com/ws/tts?api_key=YOUR_API_KEY"
    audio_chunks = []
    
    async with websockets.connect(uri) as ws:
        # Send request
        await ws.send(json.dumps({
            "text": "Hello, this is streaming audio.",
            "model_id": "kugel-1-turbo",
            "voice_id": 268,
            "cfg_scale": 2.0,
        }))
        
        # Receive chunks
        async for message in ws:
            data = json.loads(message)
            
            if "audio" in data:
                audio_chunks.append(base64.b64decode(data["audio"]))
                print(f"Chunk {data['idx']}: {data['samples']} samples")
            
            if data.get("final"):
                print(f"Complete: {data['dur_ms']}ms audio in {data['gen_ms']}ms")
                break

asyncio.run(stream_tts())

Stream Input (WebSocket)

Stream text input token-by-token for LLM integration.

WebSocket

Connection

wss://api.kugelaudio.com/ws/tts/stream?api_key=YOUR_API_KEY

Protocol

Send config: Initial configuration message
Send text: Text chunks as they arrive
Send flush: Force generation of buffered text
Send close: End the session
Receive audio: Audio chunks as they’re generated

Messages

Config Message

{
  "voice_id": 123,
  "model_id": "kugel-1-turbo",
  "cfg_scale": 2.0,
  "sample_rate": 24000
}

Text Message

{
  "text": "chunk of text"
}

Flush Message

{
  "flush": true
}

Close Message

{
  "close": true
}

Example

import asyncio
import websockets
import json
import base64

async def stream_from_llm(llm_tokens):
    uri = "wss://api.kugelaudio.com/ws/tts/stream?api_key=YOUR_API_KEY"
    
    async with websockets.connect(uri) as ws:
        # Send config
        await ws.send(json.dumps({
            "voice_id": 123,
            "model_id": "kugel-1-turbo",
            "cfg_scale": 2.0,
        }))
        
        # Stream tokens
        for token in llm_tokens:
            await ws.send(json.dumps({"text": token}))
            
            # Check for audio (non-blocking)
            try:
                message = await asyncio.wait_for(ws.recv(), timeout=0.01)
                data = json.loads(message)
                if "audio" in data:
                    audio_bytes = base64.b64decode(data["audio"])
                    play_audio(audio_bytes)
            except asyncio.TimeoutError:
                pass
        
        # Flush and close
        await ws.send(json.dumps({"flush": True}))
        await ws.send(json.dumps({"close": True}))
        
        # Receive remaining audio
        async for message in ws:
            data = json.loads(message)
            if "audio" in data:
                audio_bytes = base64.b64decode(data["audio"])
                play_audio(audio_bytes)
            if data.get("session_closed"):
                break

# Example usage
tokens = ["Hello, ", "this ", "is ", "streaming ", "from ", "an ", "LLM."]
asyncio.run(stream_from_llm(tokens))

Multi-Context Streaming (WebSocket)

Manage up to 5 independent audio streams over a single WebSocket connection. Useful for multi-speaker conversations, pre-buffering, and interleaved audio.

WebSocket

Connection

wss://api.kugelaudio.com/ws/tts/multi?api_key=YOUR_API_KEY

Client → Server Messages

Message	Description
`{"text": " ", "context_id": "ctx1", "voice_settings": {"voice_id": 123}}`	Initialize context with voice
`{"text": "Hello", "context_id": "ctx1"}`	Send text to context
`{"text": "...", "context_id": "ctx1", "flush": true}`	Send text and flush buffer
`{"flush": true, "context_id": "ctx1"}`	Flush context buffer
`{"close_context": true, "context_id": "ctx1"}`	Close specific context
`{"close_socket": true}`	Close all contexts and connection

Server → Client Messages

Message	Description
`{"context_created": true, "context_id": "ctx1"}`	Context created
`{"generation_started": true, "context_id": "ctx1", "chunk_id": 0, "text": "..."}`	Generation started
`{"audio": "base64...", "enc": "pcm_s16le", "context_id": "ctx1", "idx": 0, "sr": 24000}`	Audio chunk
`{"chunk_complete": true, "context_id": "ctx1", "chunk_id": 0, "audio_seconds": 1.2}`	Chunk complete
`{"is_final": true, "context_id": "ctx1"}`	All generation complete for context
`{"context_closed": true, "context_id": "ctx1"}`	Context closed
`{"session_closed": true, "total_audio_seconds": 5.4}`	Session ended

Voice Settings

When creating a context, pass voice settings as a nested object:

{
  "voice_settings": {
    "voice_id": 123,
    "cfg_scale": 2.0,
    "max_new_tokens": 2048
  }
}

Example

import asyncio
import websockets
import json
import base64

async def multi_speaker():
    uri = "wss://api.kugelaudio.com/ws/tts/multi?api_key=YOUR_API_KEY"
    
    async with websockets.connect(uri) as ws:
        # Create narrator context
        await ws.send(json.dumps({
            "text": " ",
            "context_id": "narrator",
            "voice_settings": {"voice_id": 123},
        }))
        
        # Create character context
        await ws.send(json.dumps({
            "text": " ",
            "context_id": "character",
            "voice_settings": {"voice_id": 456},
        }))
        
        # Send text to different speakers
        await ws.send(json.dumps({
            "text": "The story begins.",
            "context_id": "narrator",
            "flush": True,
        }))
        
        await ws.send(json.dumps({
            "text": "Hello, I'm the main character!",
            "context_id": "character",
            "flush": True,
        }))
        
        # Receive audio from both contexts
        async for message in ws:
            data = json.loads(message)
            
            if "audio" in data:
                ctx = data["context_id"]
                audio_bytes = base64.b64decode(data["audio"])
                print(f"[{ctx}] Chunk {data['idx']}: {len(audio_bytes)} bytes")
            
            if data.get("session_closed"):
                break
        
        # Close when done
        await ws.send(json.dumps({"close_socket": True}))

asyncio.run(multi_speaker())

Limits

Maximum 5 concurrent contexts per connection
Contexts auto-close after 20 seconds of inactivity

Response Fields

Audio Chunk Fields (WebSocket)

Field	Type	Description
`audio`	string	Base64-encoded PCM16 audio data
`enc`	string	Audio encoding (always `pcm_s16le`)
`idx`	integer	Chunk index (0-based)
`sr`	integer	Sample rate in Hz
`samples`	integer	Number of samples in this chunk

Streaming Stats

Field	Type	Description
`final`	boolean	Indicates generation complete
`chunks`	integer	Number of chunks generated
`total_samples`	integer	Total audio samples generated
`dur_ms`	number	Total audio duration in ms
`gen_ms`	number	Total generation time in ms
`rtf`	number	Real-time factor (gen_ms / dur_ms)

Error Responses

Validation Error

{
  "error": {
    "code": "invalid_request",
    "message": "Text exceeds maximum length",
    "details": {
      "max_length": 4096,
      "provided_length": 5000
    }
  }
}

Voice Not Found

{
  "error": {
    "code": "not_found",
    "message": "Voice not found",
    "details": {
      "voice_id": 999
    }
  }
}

Rate Limited

{
  "error": {
    "code": "rate_limited",
    "message": "Too many requests",
    "details": {
      "retry_after": 60
    }
  }
}

API Documentation

Endpoints

Generate Speech

Request Body

Spell Tags

Response

Example

Stream Speech (WebSocket)

Connection

Request Message

Response Messages

Audio Chunk

Final Message

Example

Stream Input (WebSocket)

Connection

Protocol

Messages

Config Message

Text Message

Flush Message

Close Message

Example

Multi-Context Streaming (WebSocket)

Connection

Client → Server Messages

Server → Client Messages

Voice Settings

Example

Limits

Response Fields

Audio Chunk Fields (WebSocket)

Streaming Stats

Error Responses

Validation Error

Voice Not Found

Rate Limited

API Documentation

Endpoints

​Generate Speech

​Request Body

​Spell Tags

​Response

​Example

​Stream Speech (WebSocket)

​Connection

​Request Message

​Response Messages

​Audio Chunk

​Final Message

​Example

​Stream Input (WebSocket)

​Connection

​Protocol

​Messages

​Config Message

​Text Message

​Flush Message

​Close Message

​Example

​Multi-Context Streaming (WebSocket)

​Connection

​Client → Server Messages

​Server → Client Messages

​Voice Settings

​Example

​Limits

​Response Fields

​Audio Chunk Fields (WebSocket)

​Streaming Stats

​Error Responses

​Validation Error

​Voice Not Found

​Rate Limited

Generate Speech

Request Body

Spell Tags

Response

Example

Stream Speech (WebSocket)

Connection

Request Message

Response Messages

Audio Chunk

Final Message

Example

Stream Input (WebSocket)

Connection

Protocol

Messages

Config Message

Text Message

Flush Message

Close Message

Example

Multi-Context Streaming (WebSocket)

Connection

Client → Server Messages

Server → Client Messages

Voice Settings

Example

Limits

Response Fields

Audio Chunk Fields (WebSocket)

Streaming Stats

Error Responses

Validation Error

Voice Not Found

Rate Limited