Text to Speech

Provider Disclosure: VoidAI offers text-to-speech services powered by multiple providers, including OpenAI and ElevenLabs. The specific provider used depends on the model you select in your API call.

Learn how to turn text into lifelike spoken audio using VoidAI's Text-to-Speech API, which leverages advanced TTS technology from various providers.

Overview

The Audio API provides a powerful speech endpoint based on advanced TTS (text-to-speech) technology from our partners. It supports a variety of natural-sounding voices and can be used to:

Create narration for content like articles, stories, or educational materials
Generate spoken audio in multiple languages for global applications
Provide real-time audio feedback in applications through streaming
Build accessible interfaces for users who prefer audio over text

Quickstart

The speech endpoint takes three key inputs: the model, the text to convert to audio, and the voice to use. A simple request looks like this:

from pathlib import Path
from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

speech_file_path = Path(__file__).parent / "speech.mp3"
response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Today is a wonderful day to build something people love!"
)

response.stream_to_file(speech_file_path)

By default, the endpoint returns an MP3 file of the spoken audio, but it can be configured to output other formats.

Audio Quality Options

VoidAI offers two quality tiers for text-to-speech:

Standard Quality (tts-1): Optimized for real-time applications with lower latency
High Definition (tts-1-hd): Enhanced audio quality with more natural sound, ideal for production-ready content

The standard model provides faster responses, while the HD model delivers superior audio quality at the cost of slightly higher latency.

Voice Options

VoidAI supports a comprehensive range of voices to match your specific use case and audience preferences:

alloy: A versatile, neutral voice with a balanced tone
ash: A warm, mature voice with remarkable clarity
ballad: A smooth, melodic voice with gentle inflection
coral: A bright, friendly voice with a naturally upbeat sound
echo: A deeper, authoritative voice with excellent projection
fable: A soft, soothing voice with a comfortable pace
onyx: A deep, resonant voice with exceptional warmth
nova: A professional, clear voice with precise articulation
sage: A thoughtful, measured voice with a contemplative tone
shimmer: A light, energetic voice with an engaging delivery
verse: A lyrical, expressive voice with dynamic range

Each voice has unique characteristics that may be better suited for specific content. We recommend experimenting with different voices to find the one that best matches your desired tone and audience.

Code Examples

Basic Text-to-Speech

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

response = client.audio.speech.create(
    model="tts-1-hd",
    voice="nova",
    input="Welcome to VoidAI's Text-to-Speech service. This is an example of the Nova voice."
)

with open("welcome_message.mp3", "wb") as file:
    file.write(response.content)

Streaming Real-Time Audio

For applications that require immediate audio feedback, you can stream the audio as it's being generated:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

response = client.audio.speech.create(
    model="tts-1",
    voice="shimmer",
    input="This is a streaming test to demonstrate real-time audio generation. The audio begins playing before the entire file is generated.",
)

# Stream to file as chunks are received
response.stream_to_file("streaming_output.mp3")

Supported Output Formats

VoidAI's TTS API supports multiple audio formats to suit your specific needs:

MP3 (default): Balanced quality and file size for most applications
Opus: Optimized for internet streaming and communication with low latency
AAC: Widely supported format for digital audio compression
FLAC: Lossless audio compression for high-quality audio preservation
WAV: Uncompressed audio suitable for professional audio editing
PCM: Raw audio samples (24kHz, 16-bit signed, little-endian)

Example of specifying an output format:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

response = client.audio.speech.create(
    model="tts-1-hd",
    voice="coral",
    input="This is a FLAC audio file with lossless compression.",
    response_format="flac"
)

with open("high_quality_speech.flac", "wb") as file:
    file.write(response.content)

Multi-Language Support

VoidAI's TTS system supports a wide range of languages, allowing you to generate natural-sounding speech in many languages worldwide. The system is optimized for English but performs well across numerous languages:

Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

To generate speech in a specific language, simply provide the input text in that language:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

response = client.audio.speech.create(
    model="tts-1-hd",
    voice="alloy",
    input="Bonjour! Comment allez-vous aujourd'hui? J'espère que vous passez une excellente journée."
)

with open("french_greeting.mp3", "wb") as file:
    file.write(response.content)

Advanced Usage Tips

Speech Pacing and Pauses

You can use punctuation and formatting to control the pacing of generated speech:

Use commas for short pauses
Use periods for longer pauses
Use line breaks for paragraph breaks
Use SSML tags like <break time="1s"/> for precise timing (if supported)

Pronunciation Control

For difficult words or names, consider these approaches:

Use phonetic spelling: "The patient's myocardial infarction (pronounced my-oh-KAR-dee-ul in-FARK-shun) required immediate attention."
Break words into syllables with hyphens: "Please contact Dr. Ng (pronounced En-Gee) for more information."

Optimizing for Production Use

For production applications:

Cache commonly used phrases rather than generating them repeatedly
Use the streaming API for real-time applications
Consider using the standard model for faster generation when latency is critical
Use the HD model for content that will be reused or requires higher quality

Overview​

Quickstart​

Audio Quality Options​

Voice Options​

Code Examples​

Basic Text-to-Speech​

Streaming Real-Time Audio​

Supported Output Formats​

Multi-Language Support​

Advanced Usage Tips​

Speech Pacing and Pauses​

Pronunciation Control​

Optimizing for Production Use​