Streaming Sessions - KugelAudio

For real-time LLM pipelines, use client.tts.streamingSession() instead of client.tts.stream(). The session endpoint (/ws/tts/stream) keeps a persistent WebSocket connection and accumulates LLM tokens server-side, starting generation at natural sentence boundaries.

Why not flush per sentence?

Calling send(token, flush=true) on every sentence feels intuitive, but it actually increases latency:

Each flush triggers a full model prefill (the fixed cost of loading context into the model).
The server’s KV cache cannot be reused across separate flushes, so each segment is cold.
Word-level flushing adds avoidable latency per sentence compared to letting the server batch — see Latency.

Let the server handle chunking via chunkLengthSchedule and autoMode.

Basic usage

const session = client.tts.streamingSession(
  {
    voiceId: 1071,
    modelId: 'kugel-3',
    // autoMode: emit at first sentence boundary (lowest TTFA)
    autoMode: true,
    chunkLengthSchedule: [50, 100, 150, 250],
  },
  {
    onChunk: (chunk) => playAudio(chunk.audio),
    onChunkComplete: (chunkId, audioSecs, genMs) => {
      console.log(`Chunk ${chunkId}: ${audioSecs.toFixed(2)}s audio in ${genMs}ms`);
    },
    onSessionClosed: (totalSecs) => {
      console.log(`Session complete: ${totalSecs.toFixed(2)}s total audio`);
    },
    onError: (err) => console.error('TTS error:', err),
  }
);

session.connect();

// Feed LLM tokens as they arrive
for await (const delta of openai.chat.completions.stream(...)) {
  const text = delta.choices[0]?.delta?.content;
  if (text) session.send(text);
}

// Flush remaining buffer and close
session.close();

Session Reuse

End a session without closing the WebSocket to avoid reconnection overhead (see Latency):

const session = client.tts.streamingSession(
  { voiceId: 1071 },
  { onChunk: (chunk) => playAudio(chunk.audio) }
);
session.connect();

// Session 1
session.send('Hello from voice one.');
await session.endSession(); // Keeps WebSocket open

// Session 2 — no reconnection needed
session.updateConfig({ voiceId: 1072 });
session.send('Hello from voice two.');

await session.close(); // Closes session + WebSocket

Barge-in (interrupt the current turn)

When the end user speaks over the agent, call cancelCurrent() to stop generating the current turn immediately and drop any buffered/queued text — without closing the WebSocket. Unlike endSession(), no remaining text is flushed; the turn is abandoned. The socket stays open so you can send() the next turn right away.

const session = client.tts.streamingSession(
  { voiceId: 1071 },
  {
    onChunk: (chunk) => playAudio(chunk.audio),
    onInterrupted: () => stopLocalPlayback(),
  }
);
await session.connect();

session.send('This is a very long answer the user talks over');

// VAD detected the user speaking — barge in:
await session.cancelCurrent();

// Socket still open — next turn starts immediately:
session.send('Sure, what would you like instead?', true);

cancelCurrent() resolves once the server acknowledges (onInterrupted fires), or after a short quiet timeout if the server goes silent. Stop local playback as soon as you call it — a few in-flight frames may arrive before the acknowledgement. See Barge-in for the full protocol.

Chunking presets

Preset	Config	Best for
Low-latency	`autoMode: true, chunkLengthSchedule: [50, 100, 150, 250]`	Voice assistants, chat bots
Balanced	`chunkLengthSchedule: [80, 150, 250]` (default)	General LLM streaming
High-quality	`chunkLengthSchedule: [120, 200, 300]`	Narration, long-form audio

autoMode: true and small chunkLengthSchedule values minimise time-to-first-audio. Use larger values when prosody quality matters more than TTFA.

Avoid calling send(text, true) (flush=true) on every sentence. This bypasses server-side semantic chunking, forces a cold model prefill per segment, and degrades both latency and audio quality.

Session methods

streamingSession(config, callbacks) returns a StreamingSession:

Method	Returns	Description
`session.connect()`	`Promise<void>`	Open and authenticate the WebSocket.
`session.send(text, flush?)`	`void`	Buffer `text`; `flush=true` forces synthesis of whatever is buffered.
`session.cancelCurrent()`	`Promise<void>`	Barge-in: abandon the current turn and drop buffered/queued text, keeping the socket open.
`session.endSession()`	`Promise<void>`	End the current turn (flushing remaining text) but keep the WebSocket open for reuse.
`session.updateConfig(config)`	`void`	Update configuration (e.g. `voiceId`) for the next session after `endSession()`.
`session.close()`	`Promise<void>`	Close the session and the WebSocket.
`session.isConnected`	`boolean`	Whether the underlying WebSocket is open.
`session.lastUsage`	`SessionUsage \| null`	Per-session usage (audio time + amount charged) from the most recently closed session, for billing your own customers per conversation. `null` before the first session closes. See SessionUsage.

Multi-Context Sessions

A multi-context session manages up to 20 independent audio-generation contexts over a single WebSocket. Each context has its own text buffer, voice settings, and generation queue — useful for multi-speaker conversations, pre-buffering one stream while another plays, or interleaving audio for dynamic dialogue.

const session = client.tts.createMultiContextSession({
  defaultVoiceId: 1071,
  language: 'en',
});

await session.connect({
  onChunk: (chunk) => {
    // chunk.contextId tells you which speaker this audio belongs to
    playAudio(chunk.contextId, chunk.audio);
  },
  onContextClosed: (contextId, usage) =>
    // `usage` carries this conversation's audio time + amount charged (EUR cents)
    console.log(`${contextId} finished`, usage),
  onError: (err, contextId) => console.error(contextId, err),
});

// Create contexts, optionally with different voices
session.createContext('narrator', { voiceId: 1071 });
session.createContext('character', { voiceId: 1072 });

// Send text to a specific context
session.send('narrator', 'The story begins.');
session.send('character', 'Hello there!', true); // flush

// Close one context, then the whole session
session.closeContext('narrator');
session.close();

Create the session with createMultiContextSession(config?):

Config field	Type	Default	Description
`defaultVoiceId`	`number`	–	Default voice for contexts that don’t override it.
`sampleRate`	`number`	`24000`	Output sample rate.
`outputFormat`	`string`	–	Combined codec + rate token (`pcm_8000`, `pcm_16000`, `pcm_22050`, `pcm_24000`, `ulaw_8000`, `alaw_8000`).
`cfgScale`	`number`	`2.0`	Guidance scale.
`temperature`	`number`	`0.5`	Sampling variance (0.0–1.0).
`maxNewTokens`	`number`	`2048`	Maximum tokens per generation.
`normalize`	`boolean`	`true`	Enable text normalization.
`language`	`string`	–	Normalization language.
`inactivityTimeout`	`number`	`20.0`	Seconds before an idle context auto-closes.

MultiContextSession methods and properties:

Member	Returns	Description
`connect(callbacks)`	`Promise<void>`	Open the WebSocket with `MultiContextCallbacks`.
`createContext(contextId, { voiceId?, voiceSettings? })`	`void`	Create a context with an optional voice override.
`send(contextId, text, flush?)`	`void`	Send text to a context.
`flush(contextId)`	`void`	Flush a context’s buffer.
`closeContext(contextId, immediate?)`	`void`	Close a context. `immediate=true` barges in, discarding buffered/queued text.
`keepAlive(contextId)`	`void`	Reset a context’s inactivity timeout.
`close()`	`void`	Close the session.
`usageFor(contextId)`	`SessionUsage \| null`	Per-context usage (audio time + amount charged) for a closed context — each context is its own conversation. `null` until that context closes. Also delivered as the second arg to `onContextClosed`. See SessionUsage.
`contextUsage`	`Map<string, SessionUsage>`	Map of `contextId` → usage for every context closed so far.
`sessionId`	`string \| null`	Server-assigned session ID.
`activeContexts`	`string[]`	Currently active context IDs.
`isConnected`	`boolean`	Whether the WebSocket is open.

Audio arrives via the onChunk callback as a MultiContextAudioChunk — an AudioChunk plus a contextId field identifying its context.

Multi-context types

interface MultiContextConfig {
  defaultVoiceId?: number;
  sampleRate?: number;
  cfgScale?: number;
  temperature?: number;
  maxNewTokens?: number;
  normalize?: boolean;
  language?: string;
  inactivityTimeout?: number;
}

interface MultiContextAudioChunk extends AudioChunk {
  contextId: string;  // which context this audio belongs to
}

interface ContextVoiceSettings {
  stability?: number;
  similarityBoost?: number;
  style?: number;
  useSpeakerBoost?: boolean;
  speed?: number;
}

interface MultiContextCallbacks {
  onSessionStarted?: (sessionId: string) => void;
  onContextCreated?: (contextId: string) => void;
  onChunk?: (chunk: MultiContextAudioChunk) => void;
  // All audio admitted before a flush has been delivered for this context
  // (ElevenLabs is_final equivalent); also fires before a graceful
  // onContextClosed.
  onFinal?: (contextId: string) => void;
  onContextClosed?: (contextId: string) => void;
  onContextTimeout?: (contextId: string) => void;
  onSessionClosed?: (stats: Record<string, unknown>) => void;
  onError?: (error: Error, contextId?: string) => void;
}

Shared interfaces (StreamConfig, StreamingSessionCallbacks, SessionUsage, AudioChunk) are documented in Types & Errors.

​Why not flush per sentence?

​Basic usage

​Session Reuse

​Barge-in (interrupt the current turn)

​Chunking presets

​Session methods

​Multi-Context Sessions

​Multi-context types

Why not flush per sentence?

Basic usage

Session Reuse

Barge-in (interrupt the current turn)

Chunking presets

Session methods

Multi-Context Sessions

Multi-context types