Skip to main content

Speech to Text

Provider Disclosure: VoidAI offers speech-to-text services powered by multiple providers, primarily OpenAI. The specific provider used depends on the model you select in your API call.

Convert audio recordings into accurate text transcriptions with VoidAI's Speech-to-Text API, which leverages powerful technology from our provider partners.

Overview

VoidAI's Audio API provides two primary speech recognition endpoints powered by advanced technology:

  • Transcriptions: Convert speech to text in the original language
  • Translations: Convert speech to English text, regardless of the source language

Available Models

We offer a range of models with different capabilities:

ModelDescriptionUse Case
whisper-1Versatile baseline modelGeneral transcription and translation with full parameter support
gpt-4o-mini-transcribeImproved accuracy modelHigher quality transcriptions with faster processing
gpt-4o-transcribePremium accuracy modelHighest quality transcriptions for professional use

All models support files up to 25MB in these formats: mp3, mp4, mpeg, mpga, m4a, wav, and webm.

Getting Started

Basic Transcription

To transcribe audio in its original language:

from openai import OpenAI

# Initialize the client
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

# Open the audio file
audio_file = open("recording.mp3", "rb")

# Create the transcription
transcription = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file
)

# Print the result
print(transcription.text)

Response Formats

By default, the API returns JSON responses. For whisper-1, you can request various formats:

FormatDescriptionUse Case
jsonSimple JSON with textDefault format for all models
textPlain text responseSimple integration scenarios
srtSubRip subtitle formatVideo captioning
vttWebVTT subtitle formatWeb video captioning
verbose_jsonDetailed JSON with metadataAdvanced applications needing metadata

For gpt-4o models, only json and text formats are currently supported.

Example with custom format:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("podcast.mp3", "rb")

# Request SRT format for video subtitles
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="srt"
)

# Save directly to a subtitle file
with open("subtitles.srt", "w", encoding="utf-8") as f:
f.write(transcription.text)

Translation to English

To translate foreign-language audio directly to English text:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("spanish_interview.mp3", "rb")

# Translate to English
translation = client.audio.translations.create(
model="whisper-1",
file=audio_file
)

print(translation.text)

Advanced Features

Word-Level Timestamps

For precise synchronization with video or audio, you can get timestamps for each word:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("interview.mp3", "rb")

transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word"]
)

# Example of accessing word timestamps
for word in transcript.words:
print(f"{word['word']}: {word['start']} to {word['end']}")

# Create a simple interactive transcript
html_transcript = "<div class='interactive-transcript'>"
for word in transcript.words:
html_transcript += f"<span data-start='{word['start']}' data-end='{word['end']}'>{word['word']}</span> "
html_transcript += "</div>"

with open("interactive_transcript.html", "w") as f:
f.write(html_transcript)

Streaming Transcriptions

For real-time feedback, stream results as they become available:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("lecture.mp3", "rb")

print("Starting transcription...")
stream = client.audio.transcriptions.create(
model="gpt-4o-mini-transcribe",
file=audio_file,
response_format="text",
stream=True
)

# Process streaming results
full_transcript = ""
for event in stream:
if hasattr(event, 'data'):
segment = event.data
print(f"New segment: {segment}")
full_transcript += segment
elif hasattr(event, 'text'):
print(f"Full transcript: {event.text}")

print("\nFinal transcript:", full_transcript)

Practical Applications

Processing Long Recordings

For audio files exceeding the 25MB limit, split them into manageable chunks:

from pydub import AudioSegment
import os
from openai import OpenAI

# Configure client
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

# Load and split the audio
long_audio = AudioSegment.from_mp3("long_lecture.mp3")
chunk_length_ms = 10 * 60 * 1000 # 10 minutes
chunks = [long_audio[i:i+chunk_length_ms] for i in range(0, len(long_audio), chunk_length_ms)]

# Process each chunk with context for better continuity
full_transcript = ""
previous_chunk_end = ""

for i, chunk in enumerate(chunks):
# Export temporary chunk
temp_filename = f"temp_chunk_{i}.mp3"
chunk.export(temp_filename, format="mp3")

try:
# Use previous chunk ending as context prompt
with open(temp_filename, "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file,
prompt=previous_chunk_end # Context from previous chunk
)

# Store last ~100 characters for context in next chunk
if len(transcription.text) > 100:
previous_chunk_end = transcription.text[-100:]
else:
previous_chunk_end = transcription.text

# Add to full transcript
full_transcript += transcription.text + "\n\n"
print(f"Chunk {i+1}/{len(chunks)} transcribed")

finally:
# Clean up temporary file
if os.path.exists(temp_filename):
os.remove(temp_filename)

# Save complete transcript
with open("complete_transcript.txt", "w", encoding="utf-8") as f:
f.write(full_transcript)

print("Full transcription complete!")

Improving Accuracy with Domain-Specific Prompts

For specialized content with technical terms or jargon:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("medical_lecture.mp3", "rb")

# Medical terminology prompt
medical_terms = """
The following audio contains medical terminology including:
myocardial infarction, atherosclerosis, thrombosis, ischemia,
hypertension, hyperlipidemia, diabetes mellitus, endocrinology,
electrocardiogram (ECG), echocardiogram, angiography, stethoscope,
sphygmomanometer, otoscope, ophthalmoscope, laparoscope.
"""

transcription = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file,
response_format="text",
prompt=medical_terms
)

print(transcription.text)

Post-Processing for Maximum Accuracy

For highest quality results, especially with technical content:

from openai import OpenAI

client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")

# First, get the raw transcription
audio_file = open("technical_presentation.mp3", "rb")
raw_transcription = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file
)

# Then, post-process with another model
correction_prompt = """
You are a specialized transcription editor. Your task is to:
1. Fix any likely misheard technical terms
2. Add appropriate punctuation and paragraph breaks
3. Correct grammatical errors while preserving the original meaning
4. Format speaker transitions with "Speaker 1:", "Speaker 2:", etc. when detected
5. Do not add or remove content beyond these corrections

Here is the raw transcription to correct:
"""

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": correction_prompt},
{"role": "user", "content": raw_transcription.text}
],
temperature=0.1 # Low temperature for more deterministic results
)

corrected_transcript = response.choices[0].message.content

# Save both versions for comparison
with open("raw_transcript.txt", "w", encoding="utf-8") as f:
f.write(raw_transcription.text)

with open("corrected_transcript.txt", "w", encoding="utf-8") as f:
f.write(corrected_transcript)

print("Transcription complete with post-processing corrections.")

Best Practices

Audio Quality Tips

For best results:

  • Use a high-quality microphone when possible
  • Reduce background noise during recording
  • Position speakers close to the microphone
  • Use a sampling rate of at least 16kHz
  • Choose uncompressed formats like WAV for source recordings

Model Selection Guidelines

Use CaseRecommended Model
General transcriptionwhisper-1
Subtitle generationwhisper-1 (with srt or vtt formats)
Multi-speaker contentgpt-4o-transcribe
Technical/specialized contentgpt-4o-transcribe with domain prompt
Real-time applicationsgpt-4o-mini-transcribe
Low latency needsgpt-4o-mini-transcribe
Highest accuracy needsgpt-4o-transcribe with post-processing

Performance Optimization

For production applications:

  1. Use streaming for real-time user feedback
  2. Process audio in parallel for large batches
  3. Implement retry logic for occasional API failures
  4. Pre-process audio to improve quality (noise reduction, normalization)
  5. Cache results for frequently used audio files