Skip to main content
This page is the single reference for what the TTS endpoints emit: the default PCM encoding, the opt-in output_format codecs, the audio chunk wire format, and the watermark.

Default format

  • Encoding: PCM 16-bit signed little-endian (pcm_s16le)
  • Channels: Mono (1 channel)
  • Sample rate: 24000 Hz (default; native generation rate)
  • Byte order: Little-endian
Lower sample rates (8000, 16000, 22050) use server-side resampling with negligible latency impact (see Latency).
AI watermark (EU AI Act Art. 50): All generated audio — REST and WebSocket — is automatically watermarked using AudioSeal, an imperceptible neural watermark. This is required under EU AI Act Article 50 for AI-generated audio content. The watermark is inaudible and does not affect audio quality.

Output formats (output_format)

By default the API emits linear PCM16 at sample_rate. To request a different codec — for example G.711 µ-law/a-law for telephony — send the combined output_format token instead of (or in addition to) sample_rate. The token carries codec and rate as one value, so impossible combinations like “µ-law at 24 kHz” cannot be expressed.
output_formatCodecRateenc in audio framesBytes/sample
pcm_8000Linear PCM168000pcm_s16le2
pcm_16000Linear PCM1616000pcm_s16le2
pcm_22050Linear PCM1622050pcm_s16le2
pcm_24000Linear PCM1624000pcm_s16le2
ulaw_8000G.711 µ-law8000mulaw1
alaw_8000G.711 a-law8000alaw1
Notes:
  • Backwards compatible. Omitting output_format is identical to the default behavior — you get pcm_s16le frames. The strict checks below only apply to requests that send output_format.
  • Conflicts are rejected. Sending both output_format and a sample_rate that disagrees with the token’s rate returns a VALIDATION_ERROR (HTTP 400 / WS error frame). Send one, or matching values.
  • Set-once per session/context. On the Stream Input and Multi-Context endpoints the codec is locked on first use; a later mid-stream codec change is ignored.
  • G.711 frame semantics. For ulaw_8000 / alaw_8000, audio frames carry enc: "mulaw" / "alaw", sr: 8000, and samples equals the byte length (1 byte/sample). Decode with the standard G.711 tables (e.g. Python audioop.ulaw2lin(payload, 2)). On REST, the response uses Content-Type: audio/basic and X-Audio-Format: mulaw/alaw.

Telephony example (µ-law 8 kHz)

{
  "text": "Your verification code is 4 8 1 5.",
  "voice_id": 1071,
  "output_format": "ulaw_8000",
  "language": "en"
}

Audio chunk fields

Every WebSocket endpoint streams audio as JSON frames with these fields:
FieldTypeDescription
audiostringBase64-encoded audio data (encoding per enc)
encstringAudio encoding (pcm_s16le, mulaw, or alaw)
idxintegerChunk index (0-based)
srintegerSample rate in Hz
samplesintegerNumber of samples in this chunk
chunk_idintegerText chunk ID (present on /ws/tts/stream and /ws/tts/multi)
context_idstringContext identifier (present on /ws/tts/multi)

Generate Speech

The canonical request parameter reference

ElevenLabs-compatible output

MP3 and ElevenLabs-shaped responses via the proxy