output_format codecs, the audio chunk wire format,
and the watermark.
Default format
- Encoding: PCM 16-bit signed little-endian (
pcm_s16le) - Channels: Mono (1 channel)
- Sample rate: 24000 Hz (default; native generation rate)
- Byte order: Little-endian
AI watermark (EU AI Act Art. 50): All generated audio — REST and
WebSocket — is automatically watermarked using
AudioSeal, an imperceptible
neural watermark. This is required under EU AI Act Article 50 for
AI-generated audio content. The watermark is inaudible and does not affect
audio quality.
Output formats (output_format)
By default the API emits linear PCM16 at sample_rate. To request a different
codec — for example G.711 µ-law/a-law for telephony — send the combined
output_format token instead of (or in addition to) sample_rate. The token
carries codec and rate as one value, so impossible combinations like
“µ-law at 24 kHz” cannot be expressed.
output_format | Codec | Rate | enc in audio frames | Bytes/sample |
|---|---|---|---|---|
pcm_8000 | Linear PCM16 | 8000 | pcm_s16le | 2 |
pcm_16000 | Linear PCM16 | 16000 | pcm_s16le | 2 |
pcm_22050 | Linear PCM16 | 22050 | pcm_s16le | 2 |
pcm_24000 | Linear PCM16 | 24000 | pcm_s16le | 2 |
ulaw_8000 | G.711 µ-law | 8000 | mulaw | 1 |
alaw_8000 | G.711 a-law | 8000 | alaw | 1 |
- Backwards compatible. Omitting
output_formatis identical to the default behavior — you getpcm_s16leframes. The strict checks below only apply to requests that sendoutput_format. - Conflicts are rejected. Sending both
output_formatand asample_ratethat disagrees with the token’s rate returns aVALIDATION_ERROR(HTTP 400 / WS error frame). Send one, or matching values. - Set-once per session/context. On the Stream Input and Multi-Context endpoints the codec is locked on first use; a later mid-stream codec change is ignored.
- G.711 frame semantics. For
ulaw_8000/alaw_8000, audio frames carryenc: "mulaw"/"alaw",sr: 8000, andsamplesequals the byte length (1 byte/sample). Decode with the standard G.711 tables (e.g. Pythonaudioop.ulaw2lin(payload, 2)). On REST, the response usesContent-Type: audio/basicandX-Audio-Format: mulaw/alaw.
Telephony example (µ-law 8 kHz)
Audio chunk fields
Every WebSocket endpoint streams audio as JSON frames with these fields:| Field | Type | Description |
|---|---|---|
audio | string | Base64-encoded audio data (encoding per enc) |
enc | string | Audio encoding (pcm_s16le, mulaw, or alaw) |
idx | integer | Chunk index (0-based) |
sr | integer | Sample rate in Hz |
samples | integer | Number of samples in this chunk |
chunk_id | integer | Text chunk ID (present on /ws/tts/stream and /ws/tts/multi) |
context_id | string | Context identifier (present on /ws/tts/multi) |
Related
Generate Speech
The canonical request parameter reference
ElevenLabs-compatible output
MP3 and ElevenLabs-shaped responses via the proxy