Audio Formats - KugelAudio

This page is the single reference for what the TTS endpoints emit: the default PCM encoding, the opt-in output_format codecs, the audio chunk wire format, and the watermark.

Default format

Encoding: PCM 16-bit signed little-endian (pcm_s16le)
Channels: Mono (1 channel)
Sample rate: 24000 Hz (default; native generation rate)
Byte order: Little-endian

Lower sample rates (8000, 16000, 22050) use server-side resampling with negligible latency impact (see Latency).

AI watermark (EU AI Act Art. 50): All generated audio — REST and WebSocket — is automatically watermarked using AudioSeal, an imperceptible neural watermark. This is required under EU AI Act Article 50 for AI-generated audio content. The watermark is inaudible and does not affect audio quality.

Output formats (`output_format`)

By default the API emits linear PCM16 at sample_rate. To request a different codec — for example G.711 µ-law/a-law for telephony — send the combined output_format token instead of (or in addition to) sample_rate. The token carries codec and rate as one value, so impossible combinations like “µ-law at 24 kHz” cannot be expressed.

`output_format`	Codec	Rate	`enc` in audio frames	Bytes/sample
`pcm_8000`	Linear PCM16	8000	`pcm_s16le`	2
`pcm_16000`	Linear PCM16	16000	`pcm_s16le`	2
`pcm_22050`	Linear PCM16	22050	`pcm_s16le`	2
`pcm_24000`	Linear PCM16	24000	`pcm_s16le`	2
`ulaw_8000`	G.711 µ-law	8000	`mulaw`	1
`alaw_8000`	G.711 a-law	8000	`alaw`	1

Notes:

Backwards compatible. Omitting output_format is identical to the default behavior — you get pcm_s16le frames. The strict checks below only apply to requests that send output_format.
Conflicts are rejected. Sending both output_format and a sample_rate that disagrees with the token’s rate returns a VALIDATION_ERROR (HTTP 400 / WS error frame). Send one, or matching values.
Set-once per session/context. On the Stream Input and Multi-Context endpoints the codec is locked on first use; a later mid-stream codec change is ignored.
G.711 frame semantics. For ulaw_8000 / alaw_8000, audio frames carry enc: "mulaw" / "alaw", sr: 8000, and samples equals the byte length (1 byte/sample). Decode with the standard G.711 tables (e.g. Python audioop.ulaw2lin(payload, 2)). On REST, the response uses Content-Type: audio/basic and X-Audio-Format: mulaw/alaw.

Telephony example (µ-law 8 kHz)

{
  "text": "Your verification code is 4 8 1 5.",
  "voice_id": 1071,
  "output_format": "ulaw_8000",
  "language": "en"
}

Audio chunk fields

Every WebSocket endpoint streams audio as JSON frames with these fields:

Field	Type	Description
`audio`	string	Base64-encoded audio data (encoding per `enc`)
`enc`	string	Audio encoding (`pcm_s16le`, `mulaw`, or `alaw`)
`idx`	integer	Chunk index (0-based)
`sr`	integer	Sample rate in Hz
`samples`	integer	Number of samples in this chunk
`chunk_id`	integer	Text chunk ID (present on `/ws/tts/stream` and `/ws/tts/multi`)
`context_id`	string	Context identifier (present on `/ws/tts/multi`)

Generate Speech

The canonical request parameter reference

ElevenLabs-compatible output

MP3 and ElevenLabs-shaped responses via the proxy

​Default format

​Output formats (output_format)

​Telephony example (µ-law 8 kHz)

​Audio chunk fields

​Related

Generate Speech

ElevenLabs-compatible output

Default format

Output formats (`output_format`)

Telephony example (µ-law 8 kHz)

Audio chunk fields

Related