Skip to main content
When word_timestamps: true is set, the server performs forced alignment on each generated audio chunk and sends a word_timestamps message shortly after the corresponding audio. Useful for barge-in handling (“which word was the agent on when the user interrupted?”), subtitle synchronization, and lip-sync.
Word timestamps add no extra audio latency. The alignment model runs on the same GPU as TTS and timestamps arrive ~50–200 ms after the corresponding audio chunk.

Streaming with word timestamps

for chunk in client.tts.stream(
    text="Hello, this is streaming with timestamps.",
    model_id="kugel-3",
    word_timestamps=True,
):
    if hasattr(chunk, 'audio'):
        play_audio(chunk.audio)
    elif isinstance(chunk, list):
        # Word timestamps arrive as a list of WordTimestamp objects
        for ts in chunk:
            print(f"{ts.word}: {ts.start_ms}-{ts.end_ms}ms")

The timestamp payload

Each word_timestamps message carries the alignments for one audio chunk:
{
  "word_timestamps": [
    {"word": "Hello", "start_ms": 0, "end_ms": 320, "char_start": 0, "char_end": 5, "score": 0.98},
    {"word": "world", "start_ms": 350, "end_ms": 680, "char_start": 7, "char_end": 12, "score": 0.95}
  ],
  "chunk_id": 0
}
FieldTypeDescription
wordstringThe aligned word
start_msintStart time in milliseconds (relative to chunk start)
end_msintEnd time in milliseconds (relative to chunk start)
char_startintStart character offset in the original text
char_endintEnd character offset in the original text
scorefloatAlignment confidence score (0.0 - 1.0)
Timestamps are relative to the start of their chunk — to place words on a global timeline, accumulate the duration of previous chunks.

Where timestamps are available

SurfaceHow they arrive
SDK streaming (stream, streamingSession)onWordTimestamps callback (JS/Java) or timestamp items in the chunk iterator (Python)
SDK generate() (Python)AudioResponse.word_timestamps — the SDK streams over WebSocket internally
/ws/tts, /ws/tts/stream, and /ws/tts/multiword_timestamps frames interleaved with audio frames (reference)
REST /v1/tts/generateNot supported — the field is rejected with 422; use a WebSocket endpoint or an SDK
LiveKit uses these alignments natively for transcript sync — see the LiveKit integration.