for chunk in client.tts.stream( text="Hello, this is streaming audio.", model_id="kugel-1-turbo",): if hasattr(chunk, 'audio'): play_audio(chunk.audio)
Copy
await client.tts.stream( { text: 'Hello, this is streaming audio.', modelId: 'kugel-1-turbo' }, { onChunk: (chunk) => playAudio(chunk.audio), });
Stream text token-by-token as it arrives from an LLM:
Python
JavaScript
Copy
async def stream_from_llm(llm_response): async with client.tts.streaming_session( voice_id=123, cfg_scale=2.0, flush_timeout_ms=500, # Auto-flush after 500ms of silence ) as session: # Stream tokens as they arrive async for token in llm_response: async for chunk in session.send(token): play_audio(chunk.audio) # Flush remaining text async for chunk in session.flush(): play_audio(chunk.audio)
Copy
// Using WebSocket directly for LLM streamingconst ws = new WebSocket('wss://api.kugelaudio.com/ws/tts/stream');ws.onopen = () => { // Initial config ws.send(JSON.stringify({ voice_id: 123, cfg_scale: 2.0, }));};// Send tokens as they arrive from LLMfor await (const token of llmResponse) { ws.send(JSON.stringify({ text: token }));}// Signal endws.send(JSON.stringify({ flush: true }));ws.send(JSON.stringify({ close: true }));
Use <spell> tags to spell out text letter by letter (requires normalize: true):
Python
JavaScript
Copy
# Spell out email addresses, codes, or abbreviationstext = "Contact us at <spell>[email protected]</spell> for help."for chunk in client.tts.stream( text=text, model_id="kugel-1-turbo", normalize=True, language="en",): if hasattr(chunk, 'audio'): play_audio(chunk.audio)# Output: "Contact us at S, U, P, P, O, R, T, at, K, U, G, E, L..."
Copy
// Spell out email addresses, codes, or abbreviationsconst text = 'Contact us at <spell>[email protected]</spell> for help.';await client.tts.stream( { text, modelId: 'kugel-1-turbo', normalize: true, language: 'en', }, { onChunk: (chunk) => playAudio(chunk.audio), });
Streaming with Spell Tags: When streaming text token-by-token, spell tags that span multiple
chunks are automatically handled. The system buffers text until the closing </spell> tag arrives
before generating audio. If the stream ends unexpectedly, incomplete tags are auto-closed.
Model recommendation: For clearer letter-by-letter pronunciation, use modelId: "kugel-1" instead of kugel-1-turbo.
Buffer text until sentence boundaries for more natural speech:
Copy
import redef split_sentences(text: str) -> list[str]: """Split text into sentences.""" return re.split(r'(?<=[.!?])\s+', text)async def stream_by_sentence(llm_response): buffer = "" async with client.tts.streaming_session(voice_id=123) as session: async for token in llm_response: buffer += token # Check for complete sentences sentences = split_sentences(buffer) # Process all complete sentences for sentence in sentences[:-1]: async for chunk in session.send(sentence + " "): play_audio(chunk.audio) # Keep incomplete sentence in buffer buffer = sentences[-1] if sentences else "" # Flush remaining buffer if buffer: async for chunk in session.send(buffer): play_audio(chunk.audio) async for chunk in session.flush(): play_audio(chunk.audio)
For advanced use cases like multi-speaker conversations or pre-buffering audio, use the multi-context WebSocket endpoint. This allows managing up to 5 independent audio streams over a single connection.
When word_timestamps: true is set in the initial configuration, the server performs forced alignment on each generated audio chunk and sends a word_timestamps message shortly after the corresponding audio. Each timestamp contains:
Start time in milliseconds (relative to chunk start)
end_ms
int
End time in milliseconds (relative to chunk start)
char_start
int
Start character offset in the original text
char_end
int
End character offset in the original text
score
float
Alignment confidence score (0.0 - 1.0)
Word timestamps add no extra audio latency. The alignment model runs on the same GPU as TTS and timestamps arrive ~50-200ms after the corresponding audio chunk.
For latency-critical applications, pre-establish WebSocket connections:
Copy
# Pre-warm the connectionsession = await client.tts.streaming_session(voice_id=123).__aenter__()# Later, when you need to generateasync for chunk in session.send("Hello!"): play_audio(chunk.audio)
The server automatically chunks text at natural boundaries. For custom control:
Copy
# Let the server handle chunking (recommended)await session.send(long_text)# Or chunk manually for more controlfor sentence in split_sentences(long_text): await session.send(sentence)