for chunk in client.tts.stream( text="Hello, this is streaming audio.", model_id="kugel-1-turbo",): if hasattr(chunk, 'audio'): play_audio(chunk.audio)
await client.tts.stream( { text: 'Hello, this is streaming audio.', modelId: 'kugel-1-turbo' }, { onChunk: (chunk) => playAudio(chunk.audio), });
client.tts().stream( GenerateRequest.builder("Hello, this is streaming audio.") .modelId("kugel-1-turbo") .language("en") .build(), new StreamCallbacks() { @Override public void onChunk(AudioChunk chunk) { playAudio(chunk.getAudio()); } });
# Stream audio and pipe to ffplay for real-time playbackcurl -X POST https://api.kugelaudio.com/v1/tts/generate \ -H "Authorization: Bearer $KUGELAUDIO_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "text": "Hello, this is streaming audio.", "model_id": "kugel-1-turbo" }' \ --no-buffer | ffplay -f s16le -ar 24000 -ac 1 -nodisp -# Or save to filecurl -X POST https://api.kugelaudio.com/v1/tts/generate \ -H "Authorization: Bearer $KUGELAUDIO_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "text": "Hello, this is streaming audio.", "model_id": "kugel-1-turbo" }' \ --output output.pcm
Stream text token-by-token as it arrives from an LLM. Let the server handle chunking at sentence boundaries — do not flush on every sentence from the client.
Python
JavaScript
Java
cURL
async def stream_from_llm(llm_response): async with client.tts.streaming_session( voice_id=123, model_id="kugel-1-turbo", auto_mode=True, # start at first sentence boundary chunk_length_schedule=[50, 100, 150, 250], # low-latency schedule ) as session: async for token in llm_response: async for chunk in session.send(token): play_audio(chunk.audio) # Flush remaining text async for chunk in session.flush(): play_audio(chunk.audio)
try (StreamingSession session = client.streamingSession( StreamConfig.builder() .voiceId(123) .modelId("kugel-1-turbo") .autoMode(true) .chunkLengthSchedule(List.of(50, 100, 150, 250)) .language("en") .build())) { // Stream tokens as they arrive from your LLM for (String token : llmResponse) { session.send(token, false); } // Close flushes the remaining buffer automatically}
Token-by-token LLM streaming requires a persistent WebSocket connection,
which is not supported by cURL. Use an SDK for this pattern, or connect
to the raw WebSocket API with a WebSocket client
like websocat.
Do not flush on every sentence from the client. Calling send(token, flush=True) per sentence bypasses the server’s semantic chunking, forces a cold model prefill on every segment (adding 150–400 ms), and makes latency worse, not better. Use autoMode / chunkLengthSchedule and let the server decide boundaries.
Request word-level time alignments alongside streaming audio:
Python
JavaScript
Java
cURL
for chunk in client.tts.stream( text="Hello, this is streaming with timestamps.", model_id="kugel-1-turbo", word_timestamps=True,): if hasattr(chunk, 'audio'): play_audio(chunk.audio) elif isinstance(chunk, list): # Word timestamps arrive as a list of WordTimestamp objects for ts in chunk: print(f"{ts.word}: {ts.start_ms}-{ts.end_ms}ms")
await client.tts.stream( { text: 'Hello, this is streaming with timestamps.', modelId: 'kugel-1-turbo', wordTimestamps: true, }, { onChunk: (chunk) => playAudio(chunk.audio), onWordTimestamps: (timestamps) => { for (const ts of timestamps) { console.log(`${ts.word}: ${ts.startMs}-${ts.endMs}ms`); } }, });
client.tts().stream( GenerateRequest.builder("Hello, this is streaming with timestamps.") .modelId("kugel-1-turbo") .language("en") .wordTimestamps(true) .build(), new StreamCallbacks() { @Override public void onChunk(AudioChunk chunk) { playAudio(chunk.getAudio()); } @Override public void onWordTimestamps(List<WordTimestamp> timestamps) { for (WordTimestamp ts : timestamps) { System.out.printf("%s: %d-%dms%n", ts.getWord(), ts.getStartMs(), ts.getEndMs()); } } });
curl -X POST https://api.kugelaudio.com/v1/tts/generate \ -H "Authorization: Bearer $KUGELAUDIO_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "text": "Hello, this is streaming with timestamps.", "model_id": "kugel-1-turbo", "word_timestamps": true }' \ --output output.pcm
For real-time word timestamp events during streaming, use the
WebSocket API or an SDK.
The REST endpoint returns timestamps in the response alongside the audio.
Word timestamps arrive ~50-200ms after the corresponding audio chunk. They add no extra audio latency.
Use <spell> tags to spell out text letter by letter (requires normalize: true):
Python
JavaScript
Java
cURL
# Spell out email addresses, codes, or abbreviationstext = "Contact us at <spell>hello@kugelaudio.com</spell> for help."for chunk in client.tts.stream( text=text, model_id="kugel-1-turbo", normalize=True, language="en",): if hasattr(chunk, 'audio'): play_audio(chunk.audio)# Output: "Contact us at S, U, P, P, O, R, T, at, K, U, G, E, L..."
// Spell out email addresses, codes, or abbreviationsconst text = 'Contact us at <spell>hello@kugelaudio.com</spell> for help.';await client.tts.stream( { text, modelId: 'kugel-1-turbo', normalize: true, language: 'en', }, { onChunk: (chunk) => playAudio(chunk.audio), });
client.tts().stream( GenerateRequest.builder( "Contact us at <spell>hello@kugelaudio.com</spell> for help.") .modelId("kugel-1-turbo") .normalize(true) .language("en") .build(), new StreamCallbacks() { @Override public void onChunk(AudioChunk chunk) { playAudio(chunk.getAudio()); } });
curl -X POST https://api.kugelaudio.com/v1/tts/generate \ -H "Authorization: Bearer $KUGELAUDIO_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "text": "Contact us at <spell>hello@kugelaudio.com</spell> for help.", "model_id": "kugel-1-turbo", "normalize": true, "language": "en" }' \ --output output.pcm
Streaming with Spell Tags: When streaming text token-by-token, spell tags that span multiple
chunks are automatically handled. The system buffers text until the closing </spell> tag arrives
before generating audio. If the stream ends unexpectedly, incomplete tags are auto-closed.
Model recommendation: For clearer letter-by-letter pronunciation, use modelId: "kugel-1" instead of kugel-1-turbo.
The server auto-chunks text at sentence boundaries. Two parameters let you control how eagerly it starts:
Parameter
Type
Effect
chunk_length_schedule / chunkLengthSchedule
list[int]
Minimum chars buffered before each successive chunk is emitted. Smaller = faster TTFA. Default: [5, 80, 150, 250]
auto_mode / autoMode
bool
Start at the very first clean sentence boundary. Equivalent to ElevenLabs auto_mode=true.
# Low-latency preset (voice assistants, chatbots)async with client.tts.streaming_session( voice_id=123, auto_mode=True, chunk_length_schedule=[50, 100, 150, 250],) as session: async for token in llm_stream: async for chunk in session.send(token): play_audio(chunk.audio) async for chunk in session.flush(): play_audio(chunk.audio)# High-quality preset (narration, long-form)async with client.tts.streaming_session( voice_id=123, chunk_length_schedule=[120, 200, 300],) as session: ...
For real-time voice agents, per-segment latency (time from sentence boundary to first audio of the next sentence) matters as much as initial TTFA. Two parameters let you trade audio quality for speed:
Halve the default diffusion steps for faster per-segment audio. Default: false
num_diffusion_steps / numDiffusionSteps
int
Explicit override for diffusion denoising steps (1-50). Lower = faster but lower quality.
# Fastest per-segment latency (voice agents, real-time conversations)async with client.tts.streaming_session( voice_id=123, auto_mode=True, optimize_streaming_latency=True,) as session: async for token in llm_stream: async for chunk in session.send(token): play_audio(chunk.audio) async for chunk in session.flush(): play_audio(chunk.audio)# Fine-tuned control: explicit diffusion stepsasync with client.tts.streaming_session( voice_id=123, auto_mode=True, num_diffusion_steps=5, # fewer steps = lower latency) as session: ...
optimize_streaming_latency typically reduces per-segment latency by ~40-50% with a modest quality trade-off that is acceptable for real-time voice conversations. For maximum quality (narration, podcasts), leave it disabled.
For advanced use cases like multi-speaker conversations or pre-buffering audio, use the multi-context WebSocket endpoint. This allows managing up to 5 independent audio streams over a single connection.
When word_timestamps: true is set in the initial configuration, the server performs forced alignment on each generated audio chunk and sends a word_timestamps message shortly after the corresponding audio. Each timestamp contains:
Start time in milliseconds (relative to chunk start)
end_ms
int
End time in milliseconds (relative to chunk start)
char_start
int
Start character offset in the original text
char_end
int
End character offset in the original text
score
float
Alignment confidence score (0.0 - 1.0)
Word timestamps add no extra audio latency. The alignment model runs on the same GPU as TTS and timestamps arrive ~50-200ms after the corresponding audio chunk.
For latency-critical applications, pre-establish WebSocket connections:
# Pre-warm the connectionsession = await client.tts.streaming_session(voice_id=123).__aenter__()# Later, when you need to generateasync for chunk in session.send("Hello!"): play_audio(chunk.audio)
The server automatically chunks text at natural boundaries. For custom control:
# Let the server handle chunking (recommended)await session.send(long_text)# Or chunk manually for more controlfor sentence in split_sentences(long_text): await session.send(sentence)