Speech to Text
Provider Disclosure: VoidAI offers speech-to-text services powered by multiple providers, primarily OpenAI. The specific provider used depends on the model you select in your API call.
Convert audio recordings into accurate text transcriptions with VoidAI's Speech-to-Text API, which leverages powerful technology from our provider partners.
Overview
VoidAI's Audio API provides two primary speech recognition endpoints powered by advanced technology:
- Transcriptions: Convert speech to text in the original language
- Translations: Convert speech to English text, regardless of the source language
Available Models
We offer a range of models with different capabilities:
Model | Description | Use Case |
---|---|---|
whisper-1 | Versatile baseline model | General transcription and translation with full parameter support |
gpt-4o-mini-transcribe | Improved accuracy model | Higher quality transcriptions with faster processing |
gpt-4o-transcribe | Premium accuracy model | Highest quality transcriptions for professional use |
All models support files up to 25MB in these formats: mp3
, mp4
, mpeg
, mpga
, m4a
, wav
, and webm
.
Getting Started
Basic Transcription
To transcribe audio in its original language:
from openai import OpenAI
# Initialize the client
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
# Open the audio file
audio_file = open("recording.mp3", "rb")
# Create the transcription
transcription = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file
)
# Print the result
print(transcription.text)
Response Formats
By default, the API returns JSON responses. For whisper-1
, you can request various formats:
Format | Description | Use Case |
---|---|---|
json | Simple JSON with text | Default format for all models |
text | Plain text response | Simple integration scenarios |
srt | SubRip subtitle format | Video captioning |
vtt | WebVTT subtitle format | Web video captioning |
verbose_json | Detailed JSON with metadata | Advanced applications needing metadata |
For gpt-4o
models, only json
and text
formats are currently supported.
Example with custom format:
from openai import OpenAI
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("podcast.mp3", "rb")
# Request SRT format for video subtitles
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="srt"
)
# Save directly to a subtitle file
with open("subtitles.srt", "w", encoding="utf-8") as f:
f.write(transcription.text)
Translation to English
To translate foreign-language audio directly to English text:
from openai import OpenAI
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("spanish_interview.mp3", "rb")
# Translate to English
translation = client.audio.translations.create(
model="whisper-1",
file=audio_file
)
print(translation.text)
Advanced Features
Word-Level Timestamps
For precise synchronization with video or audio, you can get timestamps for each word:
from openai import OpenAI
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("interview.mp3", "rb")
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word"]
)
# Example of accessing word timestamps
for word in transcript.words:
print(f"{word['word']}: {word['start']} to {word['end']}")
# Create a simple interactive transcript
html_transcript = "<div class='interactive-transcript'>"
for word in transcript.words:
html_transcript += f"<span data-start='{word['start']}' data-end='{word['end']}'>{word['word']}</span> "
html_transcript += "</div>"
with open("interactive_transcript.html", "w") as f:
f.write(html_transcript)
Streaming Transcriptions
For real-time feedback, stream results as they become available:
from openai import OpenAI
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("lecture.mp3", "rb")
print("Starting transcription...")
stream = client.audio.transcriptions.create(
model="gpt-4o-mini-transcribe",
file=audio_file,
response_format="text",
stream=True
)
# Process streaming results
full_transcript = ""
for event in stream:
if hasattr(event, 'data'):
segment = event.data
print(f"New segment: {segment}")
full_transcript += segment
elif hasattr(event, 'text'):
print(f"Full transcript: {event.text}")
print("\nFinal transcript:", full_transcript)
Practical Applications
Processing Long Recordings
For audio files exceeding the 25MB limit, split them into manageable chunks:
from pydub import AudioSegment
import os
from openai import OpenAI
# Configure client
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
# Load and split the audio
long_audio = AudioSegment.from_mp3("long_lecture.mp3")
chunk_length_ms = 10 * 60 * 1000 # 10 minutes
chunks = [long_audio[i:i+chunk_length_ms] for i in range(0, len(long_audio), chunk_length_ms)]
# Process each chunk with context for better continuity
full_transcript = ""
previous_chunk_end = ""
for i, chunk in enumerate(chunks):
# Export temporary chunk
temp_filename = f"temp_chunk_{i}.mp3"
chunk.export(temp_filename, format="mp3")
try:
# Use previous chunk ending as context prompt
with open(temp_filename, "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file,
prompt=previous_chunk_end # Context from previous chunk
)
# Store last ~100 characters for context in next chunk
if len(transcription.text) > 100:
previous_chunk_end = transcription.text[-100:]
else:
previous_chunk_end = transcription.text
# Add to full transcript
full_transcript += transcription.text + "\n\n"
print(f"Chunk {i+1}/{len(chunks)} transcribed")
finally:
# Clean up temporary file
if os.path.exists(temp_filename):
os.remove(temp_filename)
# Save complete transcript
with open("complete_transcript.txt", "w", encoding="utf-8") as f:
f.write(full_transcript)
print("Full transcription complete!")
Improving Accuracy with Domain-Specific Prompts
For specialized content with technical terms or jargon:
from openai import OpenAI
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
audio_file = open("medical_lecture.mp3", "rb")
# Medical terminology prompt
medical_terms = """
The following audio contains medical terminology including:
myocardial infarction, atherosclerosis, thrombosis, ischemia,
hypertension, hyperlipidemia, diabetes mellitus, endocrinology,
electrocardiogram (ECG), echocardiogram, angiography, stethoscope,
sphygmomanometer, otoscope, ophthalmoscope, laparoscope.
"""
transcription = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file,
response_format="text",
prompt=medical_terms
)
print(transcription.text)
Post-Processing for Maximum Accuracy
For highest quality results, especially with technical content:
from openai import OpenAI
client = OpenAI(api_key="yourapikey", base_url="https://api.voidai.app/v1")
# First, get the raw transcription
audio_file = open("technical_presentation.mp3", "rb")
raw_transcription = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file
)
# Then, post-process with another model
correction_prompt = """
You are a specialized transcription editor. Your task is to:
1. Fix any likely misheard technical terms
2. Add appropriate punctuation and paragraph breaks
3. Correct grammatical errors while preserving the original meaning
4. Format speaker transitions with "Speaker 1:", "Speaker 2:", etc. when detected
5. Do not add or remove content beyond these corrections
Here is the raw transcription to correct:
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": correction_prompt},
{"role": "user", "content": raw_transcription.text}
],
temperature=0.1 # Low temperature for more deterministic results
)
corrected_transcript = response.choices[0].message.content
# Save both versions for comparison
with open("raw_transcript.txt", "w", encoding="utf-8") as f:
f.write(raw_transcription.text)
with open("corrected_transcript.txt", "w", encoding="utf-8") as f:
f.write(corrected_transcript)
print("Transcription complete with post-processing corrections.")
Best Practices
Audio Quality Tips
For best results:
- Use a high-quality microphone when possible
- Reduce background noise during recording
- Position speakers close to the microphone
- Use a sampling rate of at least 16kHz
- Choose uncompressed formats like WAV for source recordings
Model Selection Guidelines
Use Case | Recommended Model |
---|---|
General transcription | whisper-1 |
Subtitle generation | whisper-1 (with srt or vtt formats) |
Multi-speaker content | gpt-4o-transcribe |
Technical/specialized content | gpt-4o-transcribe with domain prompt |
Real-time applications | gpt-4o-mini-transcribe |
Low latency needs | gpt-4o-mini-transcribe |
Highest accuracy needs | gpt-4o-transcribe with post-processing |
Performance Optimization
For production applications:
- Use streaming for real-time user feedback
- Process audio in parallel for large batches
- Implement retry logic for occasional API failures
- Pre-process audio to improve quality (noise reduction, normalization)
- Cache results for frequently used audio files