Skip to main content

Text to Speech

Provider Disclosure: VoidAI offers text-to-speech services powered by multiple providers, including OpenAI and ElevenLabs. The specific provider used depends on the model you select in your API call.

Learn how to turn text into lifelike spoken audio using VoidAI's Text-to-Speech API, which leverages advanced TTS technology from various providers.

Overview

The Audio API provides a powerful speech endpoint based on advanced TTS (text-to-speech) technology from our partners. It supports a variety of natural-sounding voices and can be used to:

  • Create narration for content like articles, stories, or educational materials
  • Generate spoken audio in multiple languages for global applications
  • Provide real-time audio feedback in applications through streaming
  • Build accessible interfaces for users who prefer audio over text

Quickstart

The speech endpoint takes three key inputs: the model, the text to convert to audio, and the voice to use. A simple request looks like this:

from pathlib import Path
from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

speech_file_path = Path(__file__).parent / "speech.mp3"
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Today is a wonderful day to build something people love!"
)

response.stream_to_file(speech_file_path)

By default, the endpoint returns an MP3 file of the spoken audio, but it can be configured to output other formats.

Audio Quality Options

VoidAI offers two quality tiers for text-to-speech:

  • Standard Quality (tts-1): Optimized for real-time applications with lower latency
  • High Definition (tts-1-hd): Enhanced audio quality with more natural sound, ideal for production-ready content

The standard model provides faster responses, while the HD model delivers superior audio quality at the cost of slightly higher latency.

Voice Options

VoidAI supports a comprehensive range of voices to match your specific use case and audience preferences:

  • alloy: A versatile, neutral voice with a balanced tone
  • ash: A warm, mature voice with remarkable clarity
  • ballad: A smooth, melodic voice with gentle inflection
  • coral: A bright, friendly voice with a naturally upbeat sound
  • echo: A deeper, authoritative voice with excellent projection
  • fable: A soft, soothing voice with a comfortable pace
  • onyx: A deep, resonant voice with exceptional warmth
  • nova: A professional, clear voice with precise articulation
  • sage: A thoughtful, measured voice with a contemplative tone
  • shimmer: A light, energetic voice with an engaging delivery
  • verse: A lyrical, expressive voice with dynamic range

Each voice has unique characteristics that may be better suited for specific content. We recommend experimenting with different voices to find the one that best matches your desired tone and audience.

Code Examples

Basic Text-to-Speech

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

response = client.audio.speech.create(
model="tts-1-hd",
voice="nova",
input="Welcome to VoidAI's Text-to-Speech service. This is an example of the Nova voice."
)

with open("welcome_message.mp3", "wb") as file:
file.write(response.content)

Streaming Real-Time Audio

For applications that require immediate audio feedback, you can stream the audio as it's being generated:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

response = client.audio.speech.create(
model="tts-1",
voice="shimmer",
input="This is a streaming test to demonstrate real-time audio generation. The audio begins playing before the entire file is generated.",
)

# Stream to file as chunks are received
response.stream_to_file("streaming_output.mp3")

Supported Output Formats

VoidAI's TTS API supports multiple audio formats to suit your specific needs:

  • MP3 (default): Balanced quality and file size for most applications
  • Opus: Optimized for internet streaming and communication with low latency
  • AAC: Widely supported format for digital audio compression
  • FLAC: Lossless audio compression for high-quality audio preservation
  • WAV: Uncompressed audio suitable for professional audio editing
  • PCM: Raw audio samples (24kHz, 16-bit signed, little-endian)

Example of specifying an output format:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

response = client.audio.speech.create(
model="tts-1-hd",
voice="coral",
input="This is a FLAC audio file with lossless compression.",
response_format="flac"
)

with open("high_quality_speech.flac", "wb") as file:
file.write(response.content)

Multi-Language Support

VoidAI's TTS system supports a wide range of languages, allowing you to generate natural-sounding speech in many languages worldwide. The system is optimized for English but performs well across numerous languages:

Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

To generate speech in a specific language, simply provide the input text in that language:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

response = client.audio.speech.create(
model="tts-1-hd",
voice="alloy",
input="Bonjour! Comment allez-vous aujourd'hui? J'espère que vous passez une excellente journée."
)

with open("french_greeting.mp3", "wb") as file:
file.write(response.content)

Advanced Usage Tips

Speech Pacing and Pauses

You can use punctuation and formatting to control the pacing of generated speech:

  • Use commas for short pauses
  • Use periods for longer pauses
  • Use line breaks for paragraph breaks
  • Use SSML tags like <break time="1s"/> for precise timing (if supported)

Pronunciation Control

For difficult words or names, consider these approaches:

  • Use phonetic spelling: "The patient's myocardial infarction (pronounced my-oh-KAR-dee-ul in-FARK-shun) required immediate attention."
  • Break words into syllables with hyphens: "Please contact Dr. Ng (pronounced En-Gee) for more information."

Optimizing for Production Use

For production applications:

  1. Cache commonly used phrases rather than generating them repeatedly
  2. Use the streaming API for real-time applications
  3. Consider using the standard model for faster generation when latency is critical
  4. Use the HD model for content that will be reused or requires higher quality