Speech to Text

Provider Disclosure: VoidAI offers speech-to-text services powered by multiple providers, primarily OpenAI. The specific provider used depends on the model you select in your API call.

Convert audio recordings into accurate text transcriptions with VoidAI's Speech-to-Text API, which leverages powerful technology from our provider partners.

Overview

VoidAI's Audio API provides two primary speech recognition endpoints powered by advanced technology:

Transcriptions: Convert speech to text in the original language
Translations: Convert speech to English text, regardless of the source language

Available Models

We offer a range of models with different capabilities:

Model	Description	Use Case
`whisper-1`	Versatile baseline model	General transcription and translation with full parameter support
`gpt-4o-mini-transcribe`	Improved accuracy model	Higher quality transcriptions with faster processing
`gpt-4o-transcribe`	Premium accuracy model	Highest quality transcriptions for professional use

All models support files up to 25MB in these formats: mp3, mp4, mpeg, mpga, m4a, wav, and webm.

Getting Started

Basic Transcription

To transcribe audio in its original language:

from openai import OpenAI

# Initialize the client
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

# Open the audio file
audio_file = open("recording.mp3", "rb")

# Create the transcription
transcription = client.audio.transcriptions.create(
    model="gpt-4o-transcribe", 
    file=audio_file
)

# Print the result
print(transcription.text)

Response Formats

By default, the API returns JSON responses. For whisper-1, you can request various formats:

Format	Description	Use Case
`json`	Simple JSON with text	Default format for all models
`text`	Plain text response	Simple integration scenarios
`srt`	SubRip subtitle format	Video captioning
`vtt`	WebVTT subtitle format	Web video captioning
`verbose_json`	Detailed JSON with metadata	Advanced applications needing metadata

For gpt-4o models, only json and text formats are currently supported.

Example with custom format:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("podcast.mp3", "rb")

# Request SRT format for video subtitles
transcription = client.audio.transcriptions.create(
    model="whisper-1", 
    file=audio_file, 
    response_format="srt"
)

# Save directly to a subtitle file
with open("subtitles.srt", "w", encoding="utf-8") as f:
    f.write(transcription.text)

Translation to English

To translate foreign-language audio directly to English text:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("spanish_interview.mp3", "rb")

# Translate to English
translation = client.audio.translations.create(
    model="whisper-1", 
    file=audio_file
)

print(translation.text)

Advanced Features

Word-Level Timestamps

For precise synchronization with video or audio, you can get timestamps for each word:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("interview.mp3", "rb")

transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="verbose_json",
    timestamp_granularities=["word"]
)

# Example of accessing word timestamps
for word in transcript.words:
    print(f"{word['word']}: {word['start']} to {word['end']}")
    
# Create a simple interactive transcript
html_transcript = "<div class='interactive-transcript'>"
for word in transcript.words:
    html_transcript += f"<span data-start='{word['start']}' data-end='{word['end']}'>{word['word']}</span> "
html_transcript += "</div>"

with open("interactive_transcript.html", "w") as f:
    f.write(html_transcript)

Streaming Transcriptions

For real-time feedback, stream results as they become available:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("lecture.mp3", "rb")

print("Starting transcription...")
stream = client.audio.transcriptions.create(
    model="gpt-4o-mini-transcribe", 
    file=audio_file, 
    response_format="text",
    stream=True
)

# Process streaming results
full_transcript = ""
for event in stream:
    if hasattr(event, 'data'):
        segment = event.data
        print(f"New segment: {segment}")
        full_transcript += segment
    elif hasattr(event, 'text'):
        print(f"Full transcript: {event.text}")

print("\nFinal transcript:", full_transcript)

Practical Applications

Processing Long Recordings

For audio files exceeding the 25MB limit, split them into manageable chunks:

from pydub import AudioSegment
import os
from openai import OpenAI

# Configure client
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

# Load and split the audio
long_audio = AudioSegment.from_mp3("long_lecture.mp3")
chunk_length_ms = 10 * 60 * 1000  # 10 minutes
chunks = [long_audio[i:i+chunk_length_ms] for i in range(0, len(long_audio), chunk_length_ms)]

# Process each chunk with context for better continuity
full_transcript = ""
previous_chunk_end = ""

for i, chunk in enumerate(chunks):
    # Export temporary chunk
    temp_filename = f"temp_chunk_{i}.mp3"
    chunk.export(temp_filename, format="mp3")
    
    try:
        # Use previous chunk ending as context prompt
        with open(temp_filename, "rb") as audio_file:
            transcription = client.audio.transcriptions.create(
                model="gpt-4o-transcribe",
                file=audio_file,
                prompt=previous_chunk_end  # Context from previous chunk
            )
        
        # Store last ~100 characters for context in next chunk
        if len(transcription.text) > 100:
            previous_chunk_end = transcription.text[-100:]
        else:
            previous_chunk_end = transcription.text
            
        # Add to full transcript
        full_transcript += transcription.text + "\n\n"
        print(f"Chunk {i+1}/{len(chunks)} transcribed")
        
    finally:
        # Clean up temporary file
        if os.path.exists(temp_filename):
            os.remove(temp_filename)

# Save complete transcript
with open("complete_transcript.txt", "w", encoding="utf-8") as f:
    f.write(full_transcript)

print("Full transcription complete!")

Improving Accuracy with Domain-Specific Prompts

For specialized content with technical terms or jargon:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("medical_lecture.mp3", "rb")

# Medical terminology prompt
medical_terms = """
The following audio contains medical terminology including: 
myocardial infarction, atherosclerosis, thrombosis, ischemia, 
hypertension, hyperlipidemia, diabetes mellitus, endocrinology, 
electrocardiogram (ECG), echocardiogram, angiography, stethoscope, 
sphygmomanometer, otoscope, ophthalmoscope, laparoscope.
"""

transcription = client.audio.transcriptions.create(
    model="gpt-4o-transcribe", 
    file=audio_file, 
    response_format="text",
    prompt=medical_terms
)

print(transcription.text)

Post-Processing for Maximum Accuracy

For highest quality results, especially with technical content:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

# First, get the raw transcription
audio_file = open("technical_presentation.mp3", "rb")
raw_transcription = client.audio.transcriptions.create(
    model="gpt-4o-transcribe", 
    file=audio_file
)

# Then, post-process with another model
correction_prompt = """
You are a specialized transcription editor. Your task is to:
1. Fix any likely misheard technical terms
2. Add appropriate punctuation and paragraph breaks
3. Correct grammatical errors while preserving the original meaning
4. Format speaker transitions with "Speaker 1:", "Speaker 2:", etc. when detected
5. Do not add or remove content beyond these corrections

Here is the raw transcription to correct:
"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": correction_prompt},
        {"role": "user", "content": raw_transcription.text}
    ],
    temperature=0.1  # Low temperature for more deterministic results
)

corrected_transcript = response.choices[0].message.content

# Save both versions for comparison
with open("raw_transcript.txt", "w", encoding="utf-8") as f:
    f.write(raw_transcription.text)
    
with open("corrected_transcript.txt", "w", encoding="utf-8") as f:
    f.write(corrected_transcript)

print("Transcription complete with post-processing corrections.")

Best Practices

Audio Quality Tips

For best results:

Use a high-quality microphone when possible
Reduce background noise during recording
Position speakers close to the microphone
Use a sampling rate of at least 16kHz
Choose uncompressed formats like WAV for source recordings

Model Selection Guidelines

Use Case	Recommended Model
General transcription	`whisper-1`
Subtitle generation	`whisper-1` (with `srt` or `vtt` formats)
Multi-speaker content	`gpt-4o-transcribe`
Technical/specialized content	`gpt-4o-transcribe` with domain prompt
Real-time applications	`gpt-4o-mini-transcribe`
Low latency needs	`gpt-4o-mini-transcribe`
Highest accuracy needs	`gpt-4o-transcribe` with post-processing

Performance Optimization

For production applications:

Use streaming for real-time user feedback
Process audio in parallel for large batches
Implement retry logic for occasional API failures
Pre-process audio to improve quality (noise reduction, normalization)
Cache results for frequently used audio files

Overview​

Available Models​

Getting Started​

Basic Transcription​

Response Formats​

Translation to English​

Advanced Features​

Word-Level Timestamps​

Streaming Transcriptions​

Practical Applications​

Processing Long Recordings​

Improving Accuracy with Domain-Specific Prompts​

Post-Processing for Maximum Accuracy​

Best Practices​

Audio Quality Tips​

Model Selection Guidelines​

Performance Optimization​