When word_timestamps: true is set, the server performs forced alignment on
each generated audio chunk and sends a word_timestamps message shortly after
the corresponding audio. Useful for barge-in handling (“which word was the
agent on when the user interrupted?”), subtitle synchronization, and lip-sync.
Word timestamps add no extra audio latency. The alignment model runs on the
same GPU as TTS and timestamps arrive ~50–200 ms after the corresponding
audio chunk.
Streaming with word timestamps
Python
JavaScript
Java
WebSocket (raw)
for chunk in client.tts.stream(
text="Hello, this is streaming with timestamps.",
model_id="kugel-3",
word_timestamps=True,
):
if hasattr(chunk, 'audio'):
play_audio(chunk.audio)
elif isinstance(chunk, list):
# Word timestamps arrive as a list of WordTimestamp objects
for ts in chunk:
print(f"{ts.word}: {ts.start_ms}-{ts.end_ms}ms")
await client.tts.stream(
{
text: 'Hello, this is streaming with timestamps.',
modelId: 'kugel-3',
wordTimestamps: true,
},
{
onChunk: (chunk) => playAudio(chunk.audio),
onWordTimestamps: (timestamps) => {
for (const ts of timestamps) {
console.log(`${ts.word}: ${ts.startMs}-${ts.endMs}ms`);
}
},
}
);
client.tts().stream(
GenerateRequest.builder("Hello, this is streaming with timestamps.")
.modelId("kugel-3")
.language("en")
.wordTimestamps(true)
.build(),
new StreamCallbacks() {
@Override
public void onChunk(AudioChunk chunk) {
playAudio(chunk.getAudio());
}
@Override
public void onWordTimestamps(List<WordTimestamp> timestamps) {
for (WordTimestamp ts : timestamps) {
System.out.printf("%s: %d-%dms%n",
ts.getWord(), ts.getStartMs(), ts.getEndMs());
}
}
}
);
Word timestamps are only available on the WebSocket endpoints — the REST
/v1/tts/generate endpoint does not accept word_timestamps (strict
request validation returns 422 Unprocessable Entity). Without an SDK,
connect to a WebSocket endpoint directly:wscat -c "wss://api.kugelaudio.com/ws/tts?api_key=YOUR_API_KEY"
> {"text": "Hello, this is streaming with timestamps.", "voice_id": 978, "model_id": "kugel-3", "word_timestamps": true}
word_timestamps frames arrive interleaved with the audio frames — see
the /ws/tts reference and
/ws/tts/stream reference.
The timestamp payload
Each word_timestamps message carries the alignments for one audio chunk:
{
"word_timestamps": [
{"word": "Hello", "start_ms": 0, "end_ms": 320, "char_start": 0, "char_end": 5, "score": 0.98},
{"word": "world", "start_ms": 350, "end_ms": 680, "char_start": 7, "char_end": 12, "score": 0.95}
],
"chunk_id": 0
}
| Field | Type | Description |
|---|
word | string | The aligned word |
start_ms | int | Start time in milliseconds (relative to chunk start) |
end_ms | int | End time in milliseconds (relative to chunk start) |
char_start | int | Start character offset in the original text |
char_end | int | End character offset in the original text |
score | float | Alignment confidence score (0.0 - 1.0) |
Timestamps are relative to the start of their chunk — to place words on a
global timeline, accumulate the duration of previous chunks.
Where timestamps are available
| Surface | How they arrive |
|---|
SDK streaming (stream, streamingSession) | onWordTimestamps callback (JS/Java) or timestamp items in the chunk iterator (Python) |
SDK generate() (Python) | AudioResponse.word_timestamps — the SDK streams over WebSocket internally |
/ws/tts, /ws/tts/stream, and /ws/tts/multi | word_timestamps frames interleaved with audio frames (reference) |
REST /v1/tts/generate | Not supported — the field is rejected with 422; use a WebSocket endpoint or an SDK |
LiveKit uses these alignments natively for transcript sync — see the
LiveKit integration.