Text to Speech
Provider Disclosure: VoidAI offers text-to-speech services powered by multiple providers, including OpenAI and ElevenLabs. The specific provider used depends on the model you select in your API call.
Learn how to turn text into lifelike spoken audio using VoidAI's Text-to-Speech API, which leverages advanced TTS technology from various providers.
Overview
The Audio API provides a powerful speech
endpoint based on advanced TTS (text-to-speech) technology from our partners. It supports a variety of natural-sounding voices and can be used to:
- Create narration for content like articles, stories, or educational materials
- Generate spoken audio in multiple languages for global applications
- Provide real-time audio feedback in applications through streaming
- Build accessible interfaces for users who prefer audio over text
Quickstart
The speech
endpoint takes three key inputs: the model, the text to convert to audio, and the voice to use. A simple request looks like this:
from pathlib import Path
from openai import OpenAI
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
speech_file_path = Path(__file__).parent / "speech.mp3"
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Today is a wonderful day to build something people love!"
)
response.stream_to_file(speech_file_path)
By default, the endpoint returns an MP3 file of the spoken audio, but it can be configured to output other formats.
Audio Quality Options
VoidAI offers two quality tiers for text-to-speech:
- Standard Quality (
tts-1
): Optimized for real-time applications with lower latency - High Definition (
tts-1-hd
): Enhanced audio quality with more natural sound, ideal for production-ready content
The standard model provides faster responses, while the HD model delivers superior audio quality at the cost of slightly higher latency.
Voice Options
VoidAI supports a comprehensive range of voices to match your specific use case and audience preferences:
alloy
: A versatile, neutral voice with a balanced toneash
: A warm, mature voice with remarkable clarityballad
: A smooth, melodic voice with gentle inflectioncoral
: A bright, friendly voice with a naturally upbeat soundecho
: A deeper, authoritative voice with excellent projectionfable
: A soft, soothing voice with a comfortable paceonyx
: A deep, resonant voice with exceptional warmthnova
: A professional, clear voice with precise articulationsage
: A thoughtful, measured voice with a contemplative toneshimmer
: A light, energetic voice with an engaging deliveryverse
: A lyrical, expressive voice with dynamic range
Each voice has unique characteristics that may be better suited for specific content. We recommend experimenting with different voices to find the one that best matches your desired tone and audience.
Code Examples
Basic Text-to-Speech
from openai import OpenAI
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
response = client.audio.speech.create(
model="tts-1-hd",
voice="nova",
input="Welcome to VoidAI's Text-to-Speech service. This is an example of the Nova voice."
)
with open("welcome_message.mp3", "wb") as file:
file.write(response.content)
Streaming Real-Time Audio
For applications that require immediate audio feedback, you can stream the audio as it's being generated:
from openai import OpenAI
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
response = client.audio.speech.create(
model="tts-1",
voice="shimmer",
input="This is a streaming test to demonstrate real-time audio generation. The audio begins playing before the entire file is generated.",
)
# Stream to file as chunks are received
response.stream_to_file("streaming_output.mp3")
Supported Output Formats
VoidAI's TTS API supports multiple audio formats to suit your specific needs:
- MP3 (default): Balanced quality and file size for most applications
- Opus: Optimized for internet streaming and communication with low latency
- AAC: Widely supported format for digital audio compression
- FLAC: Lossless audio compression for high-quality audio preservation
- WAV: Uncompressed audio suitable for professional audio editing
- PCM: Raw audio samples (24kHz, 16-bit signed, little-endian)
Example of specifying an output format:
from openai import OpenAI
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
response = client.audio.speech.create(
model="tts-1-hd",
voice="coral",
input="This is a FLAC audio file with lossless compression.",
response_format="flac"
)
with open("high_quality_speech.flac", "wb") as file:
file.write(response.content)
Multi-Language Support
VoidAI's TTS system supports a wide range of languages, allowing you to generate natural-sounding speech in many languages worldwide. The system is optimized for English but performs well across numerous languages:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
To generate speech in a specific language, simply provide the input text in that language:
from openai import OpenAI
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
response = client.audio.speech.create(
model="tts-1-hd",
voice="alloy",
input="Bonjour! Comment allez-vous aujourd'hui? J'espère que vous passez une excellente journée."
)
with open("french_greeting.mp3", "wb") as file:
file.write(response.content)
Advanced Usage Tips
Speech Pacing and Pauses
You can use punctuation and formatting to control the pacing of generated speech:
- Use commas for short pauses
- Use periods for longer pauses
- Use line breaks for paragraph breaks
- Use SSML tags like
<break time="1s"/>
for precise timing (if supported)
Pronunciation Control
For difficult words or names, consider these approaches:
- Use phonetic spelling: "The patient's myocardial infarction (pronounced my-oh-KAR-dee-ul in-FARK-shun) required immediate attention."
- Break words into syllables with hyphens: "Please contact Dr. Ng (pronounced En-Gee) for more information."
Optimizing for Production Use
For production applications:
- Cache commonly used phrases rather than generating them repeatedly
- Use the streaming API for real-time applications
- Consider using the standard model for faster generation when latency is critical
- Use the HD model for content that will be reused or requires higher quality