Skip to main content

Text-to-Speech

Basic Generation

Generate complete audio and receive it all at once:
import com.kugelaudio.sdk.GenerateRequest;
import com.kugelaudio.sdk.AudioResponse;

AudioResponse audio = client.tts().generate(
    GenerateRequest.builder("Hello, this is a test of the KugelAudio text-to-speech system.")
        .modelId("kugel-3")        // Canonical production model (see /models)
        .voiceId(1071)             // Optional: specific voice ID
        .cfgScale(2.0)             // Guidance scale (1.0-5.0)
        .temperature(0.5)          // Sampling variance 0.0-1.0 (omit for server default)
        .maxNewTokens(2048)
        .sampleRate(24000)
        .normalize(true)           // Enable text normalization (default)
        .language("en")            // Language for normalization (see below)
        .wordTimestamps(false)
        .speed(1.0)                // Playback speed 0.8-1.2 (pitch-preserving WSOLA)
        // Optional: per-request dictionary selection. Omit = all active
        // dictionaries; List.of() = none; a list = exactly those (incl.
        // inactive), bypassing the language filter.
        // .dictionaryIds(List.of(7, 9))
        .build()
);

// Audio properties
System.out.printf("Duration: %.2fms%n", audio.getDurationMs());
System.out.println("Samples: " + audio.getTotalSamples());
System.out.println("Sample rate: " + audio.getSampleRate() + " Hz");
System.out.printf("Generation time: %.0fms%n", audio.getGenerationMs());
System.out.printf("RTF: %.2f%n", audio.getRtf());

// Save to WAV file
audio.saveWav(java.nio.file.Path.of("output.wav"));

// Get raw PCM16 bytes (signed 16-bit little-endian, mono)
byte[] pcmData = audio.getAudio();

// Get WAV bytes with header (in-memory)
byte[] wavBytes = audio.toWavBytes();

// Get normalised float samples [-1.0, 1.0]
float[] floatData = audio.toFloat32();

Streaming Audio

Receive audio chunks as they are generated for lower latency:
import com.kugelaudio.sdk.GenerateRequest;
import com.kugelaudio.sdk.StreamCallbacks;
import com.kugelaudio.sdk.AudioChunk;

client.tts().stream(
    GenerateRequest.builder("Hello, this is streaming audio.")
        .modelId("kugel-3")
        .language("en")
        .build(),
    new StreamCallbacks() {
        @Override
        public void onChunk(AudioChunk chunk) {
            System.out.printf("Chunk %d: %d bytes, %d samples%n",
                chunk.getIndex(), chunk.getAudio().length, chunk.getSamples());
            // playAudio(chunk.getAudio());
        }

        @Override
        public void onComplete(AudioResponse response) {
            System.out.printf("Total duration: %.0fms%n", response.getDurationMs());
            System.out.printf("Generation time: %.0fms%n", response.getGenerationMs());
        }

        @Override
        public void onError(com.kugelaudio.sdk.KugelAudioException error) {
            System.err.println("TTS error: " + error.getMessage());
        }
    }
);

Text Normalization

Text normalization converts numbers, dates, times, and other non-verbal text into spoken words:
  • “I have 3 apples” → “I have three apples”
  • “The meeting is at 2:30 PM” → “The meeting is at two thirty PM”
  • “€50.99” → “fifty euros and ninety-nine cents”
// With explicit language (recommended - fastest)
AudioResponse audio = client.tts().generate(
    GenerateRequest.builder("I bought 3 items for €50.99 on 01/15/2024.")
        .normalize(true)
        .language("en")  // Specify language for best performance
        .build()
);

// With auto-detection (may cause incorrect normalizations)
AudioResponse audio = client.tts().generate(
    GenerateRequest.builder("Ich habe 3 Artikel für 50,99€ gekauft.")
        .normalize(true)
        // language not set - will auto-detect
        .build()
);

Supported Languages

CodeLanguageCodeLanguage
deGermannlDutch
enEnglishplPolish
frFrenchsvSwedish
esSpanishdaDanish
itItaliannoNorwegian
ptPortuguesefiFinnish
csCzechhuHungarian
roRomanianelGreek
ukUkrainianbgBulgarian
trTurkishviVietnamese
arArabichiHindi
zhChinesejaJapanese
koKorean
Using .normalize(true) without .language(...) may cause incorrect normalizations, especially for short texts or languages that share similar vocabulary. Always specify language when you know it.

Spell Tags

Use <spell> tags to spell out text letter by letter — useful for email addresses, codes, and acronyms:
// Spell out an email address
AudioResponse audio = client.tts().generate(
    GenerateRequest.builder("Contact me at <spell>kajo@kugelaudio.com</spell>")
        .normalize(true)
        .language("en")
        .build()
);
// Output: "Contact me at K, A, J, O, at, K, U, G, E, L, A, U, D, I, O, dot, C, O, M"
Special Characters: Characters like @, ., - are translated to language-specific words. For example, @ becomes “at” in English, “ät” in German, and “arobase” in French.
Model recommendation: use kugel-3 for the cleanest letter-by-letter pronunciation of spelled-out text.

Word Timestamps

Request word-level time alignments alongside audio for subtitle synchronization, lip-sync, or barge-in handling.

With Generate

import com.kugelaudio.sdk.WordTimestamp;

AudioResponse audio = client.tts().generate(
    GenerateRequest.builder("Hello, how are you today?")
        .modelId("kugel-3")
        .language("en")
        .wordTimestamps(true)
        .build()
);

for (WordTimestamp ts : audio.getWordTimestamps()) {
    System.out.printf("%s: %dms - %dms (score: %.2f)%n",
        ts.getWord(), ts.getStartMs(), ts.getEndMs(), ts.getScore());
}
// Hello: 0ms - 320ms (score: 0.98)
// how: 350ms - 480ms (score: 0.95)

With Streaming

client.tts().stream(
    GenerateRequest.builder("Hello, how are you today?")
        .modelId("kugel-3")
        .language("en")
        .wordTimestamps(true)
        .build(),
    new StreamCallbacks() {
        @Override
        public void onChunk(AudioChunk chunk) {
            playAudio(chunk.getAudio());
        }

        @Override
        public void onWordTimestamps(java.util.List<WordTimestamp> timestamps) {
            for (WordTimestamp ts : timestamps) {
                System.out.printf("%s: %dms-%dms%n",
                    ts.getWord(), ts.getStartMs(), ts.getEndMs());
            }
        }
    }
);
Word timestamps add no extra audio latency. They arrive shortly after the corresponding audio chunk — see Latency for typical numbers.

Models

List Available Models

import com.kugelaudio.sdk.Model;

List<Model> models = client.models().list();

for (Model model : models) {
    System.out.println(model.getId() + ": " + model.getName());
    System.out.println("  Description: " + model.getDescription());
    System.out.println("  Parameters: " + model.getParameters());
    System.out.println("  Max Input: " + model.getMaxInputLength() + " characters");
    System.out.println("  Sample Rate: " + model.getSampleRate() + " Hz");
}

Error Handling

import com.kugelaudio.sdk.*;

try {
    AudioResponse audio = client.tts().generate(
        GenerateRequest.builder("Hello!").language("en").build()
    );
} catch (AuthenticationException e) {
    System.err.println("Invalid API key");
} catch (RateLimitException e) {
    System.err.println("Rate limit exceeded, please wait");
} catch (InsufficientCreditsException e) {
    System.err.println("Not enough credits, please top up");
} catch (ValidationException e) {
    System.err.println("Invalid request: " + e.getMessage());
} catch (ConnectionException e) {
    System.err.println("Failed to connect to server");
} catch (KugelAudioException e) {
    System.err.println("API error: " + e.getMessage());
}

Next: LLM Sessions — real-time TTS for LLM token streams, barge-in, and multi-context sessions.