Text-to-Speech
Basic Generation
Generate complete audio and receive it all at once:
import com.kugelaudio.sdk.GenerateRequest;
import com.kugelaudio.sdk.AudioResponse;
AudioResponse audio = client.tts().generate(
GenerateRequest.builder("Hello, this is a test of the KugelAudio text-to-speech system.")
.modelId("kugel-3") // Canonical production model (see /models)
.voiceId(1071) // Optional: specific voice ID
.cfgScale(2.0) // Guidance scale (1.0-5.0)
.temperature(0.5) // Sampling variance 0.0-1.0 (omit for server default)
.maxNewTokens(2048)
.sampleRate(24000)
.normalize(true) // Enable text normalization (default)
.language("en") // Language for normalization (see below)
.wordTimestamps(false)
.speed(1.0) // Playback speed 0.8-1.2 (pitch-preserving WSOLA)
// Optional: per-request dictionary selection. Omit = all active
// dictionaries; List.of() = none; a list = exactly those (incl.
// inactive), bypassing the language filter.
// .dictionaryIds(List.of(7, 9))
.build()
);
// Audio properties
System.out.printf("Duration: %.2fms%n", audio.getDurationMs());
System.out.println("Samples: " + audio.getTotalSamples());
System.out.println("Sample rate: " + audio.getSampleRate() + " Hz");
System.out.printf("Generation time: %.0fms%n", audio.getGenerationMs());
System.out.printf("RTF: %.2f%n", audio.getRtf());
// Save to WAV file
audio.saveWav(java.nio.file.Path.of("output.wav"));
// Get raw PCM16 bytes (signed 16-bit little-endian, mono)
byte[] pcmData = audio.getAudio();
// Get WAV bytes with header (in-memory)
byte[] wavBytes = audio.toWavBytes();
// Get normalised float samples [-1.0, 1.0]
float[] floatData = audio.toFloat32();
Streaming Audio
Receive audio chunks as they are generated for lower latency:
import com.kugelaudio.sdk.GenerateRequest;
import com.kugelaudio.sdk.StreamCallbacks;
import com.kugelaudio.sdk.AudioChunk;
client.tts().stream(
GenerateRequest.builder("Hello, this is streaming audio.")
.modelId("kugel-3")
.language("en")
.build(),
new StreamCallbacks() {
@Override
public void onChunk(AudioChunk chunk) {
System.out.printf("Chunk %d: %d bytes, %d samples%n",
chunk.getIndex(), chunk.getAudio().length, chunk.getSamples());
// playAudio(chunk.getAudio());
}
@Override
public void onComplete(AudioResponse response) {
System.out.printf("Total duration: %.0fms%n", response.getDurationMs());
System.out.printf("Generation time: %.0fms%n", response.getGenerationMs());
}
@Override
public void onError(com.kugelaudio.sdk.KugelAudioException error) {
System.err.println("TTS error: " + error.getMessage());
}
}
);
Text Normalization
Text normalization converts numbers, dates, times, and other non-verbal text into spoken words:
- “I have 3 apples” → “I have three apples”
- “The meeting is at 2:30 PM” → “The meeting is at two thirty PM”
- “€50.99” → “fifty euros and ninety-nine cents”
// With explicit language (recommended - fastest)
AudioResponse audio = client.tts().generate(
GenerateRequest.builder("I bought 3 items for €50.99 on 01/15/2024.")
.normalize(true)
.language("en") // Specify language for best performance
.build()
);
// With auto-detection (may cause incorrect normalizations)
AudioResponse audio = client.tts().generate(
GenerateRequest.builder("Ich habe 3 Artikel für 50,99€ gekauft.")
.normalize(true)
// language not set - will auto-detect
.build()
);
Supported Languages
| Code | Language | Code | Language |
|---|
de | German | nl | Dutch |
en | English | pl | Polish |
fr | French | sv | Swedish |
es | Spanish | da | Danish |
it | Italian | no | Norwegian |
pt | Portuguese | fi | Finnish |
cs | Czech | hu | Hungarian |
ro | Romanian | el | Greek |
uk | Ukrainian | bg | Bulgarian |
tr | Turkish | vi | Vietnamese |
ar | Arabic | hi | Hindi |
zh | Chinese | ja | Japanese |
ko | Korean | | |
Using .normalize(true) without .language(...) may cause incorrect normalizations, especially for short texts or languages that share similar vocabulary. Always specify language when you know it.
Use <spell> tags to spell out text letter by letter — useful for email addresses, codes, and acronyms:
// Spell out an email address
AudioResponse audio = client.tts().generate(
GenerateRequest.builder("Contact me at <spell>kajo@kugelaudio.com</spell>")
.normalize(true)
.language("en")
.build()
);
// Output: "Contact me at K, A, J, O, at, K, U, G, E, L, A, U, D, I, O, dot, C, O, M"
Special Characters: Characters like @, ., - are translated to language-specific words.
For example, @ becomes “at” in English, “ät” in German, and “arobase” in French.
Model recommendation: use kugel-3 for the cleanest letter-by-letter pronunciation of spelled-out text.
Word Timestamps
Request word-level time alignments alongside audio for subtitle synchronization, lip-sync, or barge-in handling.
With Generate
import com.kugelaudio.sdk.WordTimestamp;
AudioResponse audio = client.tts().generate(
GenerateRequest.builder("Hello, how are you today?")
.modelId("kugel-3")
.language("en")
.wordTimestamps(true)
.build()
);
for (WordTimestamp ts : audio.getWordTimestamps()) {
System.out.printf("%s: %dms - %dms (score: %.2f)%n",
ts.getWord(), ts.getStartMs(), ts.getEndMs(), ts.getScore());
}
// Hello: 0ms - 320ms (score: 0.98)
// how: 350ms - 480ms (score: 0.95)
With Streaming
client.tts().stream(
GenerateRequest.builder("Hello, how are you today?")
.modelId("kugel-3")
.language("en")
.wordTimestamps(true)
.build(),
new StreamCallbacks() {
@Override
public void onChunk(AudioChunk chunk) {
playAudio(chunk.getAudio());
}
@Override
public void onWordTimestamps(java.util.List<WordTimestamp> timestamps) {
for (WordTimestamp ts : timestamps) {
System.out.printf("%s: %dms-%dms%n",
ts.getWord(), ts.getStartMs(), ts.getEndMs());
}
}
}
);
Word timestamps add no extra audio latency. They arrive shortly after the
corresponding audio chunk — see Latency for typical numbers.
Models
List Available Models
import com.kugelaudio.sdk.Model;
List<Model> models = client.models().list();
for (Model model : models) {
System.out.println(model.getId() + ": " + model.getName());
System.out.println(" Description: " + model.getDescription());
System.out.println(" Parameters: " + model.getParameters());
System.out.println(" Max Input: " + model.getMaxInputLength() + " characters");
System.out.println(" Sample Rate: " + model.getSampleRate() + " Hz");
}
Error Handling
import com.kugelaudio.sdk.*;
try {
AudioResponse audio = client.tts().generate(
GenerateRequest.builder("Hello!").language("en").build()
);
} catch (AuthenticationException e) {
System.err.println("Invalid API key");
} catch (RateLimitException e) {
System.err.println("Rate limit exceeded, please wait");
} catch (InsufficientCreditsException e) {
System.err.println("Not enough credits, please top up");
} catch (ValidationException e) {
System.err.println("Invalid request: " + e.getMessage());
} catch (ConnectionException e) {
System.err.println("Failed to connect to server");
} catch (KugelAudioException e) {
System.err.println("API error: " + e.getMessage());
}
Next: LLM Sessions — real-time TTS for LLM token streams, barge-in, and multi-context sessions.