Quick Use
pip install assemblyaiaai.settings.api_key = ASSEMBLYAI_KEYaai.Transcriber().transcribe(file_or_url)for batch,RealtimeTranscriberfor streaming
Intro
Universal-2 is AssemblyAI's latest production STT model — sub-500ms streaming latency, 99 languages, automatic speaker diarization, smart formatting (currency, dates, addresses, profanity filter), and an OpenAI-compatible audio.transcriptions endpoint for drop-in migration. Best for: voice agents on calls, meeting transcription, accessibility captions, multilingual support flows. Works with: Python, Node, Go SDKs; REST; streaming WebSocket; OpenAI-compatible API. Setup time: 5 minutes.
Batch transcription (file)
import assemblyai as aai
aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]
transcriber = aai.Transcriber()
transcript = transcriber.transcribe(
"meeting.mp3",
config=aai.TranscriptionConfig(
speaker_labels=True,
language_detection=True,
punctuate=True,
format_text=True,
speech_model=aai.SpeechModel.universal, # Universal-2
),
)
for u in transcript.utterances:
print(f"Speaker {u.speaker}: {u.text}")Real-time streaming (WebSocket)
import assemblyai as aai
def on_data(transcript: aai.RealtimeTranscript):
if isinstance(transcript, aai.RealtimeFinalTranscript):
print(f"FINAL: {transcript.text}")
else:
print(f"partial: {transcript.text}")
transcriber = aai.RealtimeTranscriber(
sample_rate=16_000,
on_data=on_data,
on_error=lambda e: print(f"err: {e}"),
)
transcriber.connect()
transcriber.stream(mic_audio_iterator()) # bytes iterator
transcriber.close()OpenAI-compatible (zero-code migration)
from openai import OpenAI
client = OpenAI(
base_url="https://api.assemblyai.com/v1",
api_key=os.environ["ASSEMBLYAI_API_KEY"],
)
transcript = client.audio.transcriptions.create(
model="universal-2",
file=open("audio.mp3", "rb"),
response_format="verbose_json",
)
print(transcript.text)Feature flags worth knowing
| Flag | What it does |
|---|---|
speaker_labels |
Diarize 2-10 speakers automatically |
auto_chapters |
Generate chapter summaries every ~5 min |
entity_detection |
Tag PII (person, org, location, card, phone) |
pii_redaction |
Replace detected PII with [REDACTED] |
sentiment_analysis |
Per-sentence sentiment scores |
summarization |
Auto-generate transcript summary |
language_detection |
Detect spoken language, no need to pre-specify |
FAQ
Q: Universal-2 vs Whisper-large-v3? A: Universal-2 has better diarization, smart formatting, and per-language tuning — best for production English/Spanish calls. Whisper-large-v3 has broader low-resource language coverage and is open-weight. For voice agents and call centers, Universal-2 typically wins on word error rate and formatting.
Q: How accurate is the speaker diarization? A: On clean two-speaker call audio, ~95% accuracy. Drops to ~85-90% with 4+ speakers, overlapping speech, or heavy background noise. For high-stakes diarization (legal transcripts) review human-in-the-loop on cluster boundaries.
Q: Pricing? A: Streaming: $0.47/hr. Batch async: $0.37/hr (Universal-2 default). Plus add-ons per feature (speaker labels +$0.13/hr, summarization +$0.13/hr, etc). Free $50 trial credit. See assemblyai.com/pricing.
Source & Thanks
Built by AssemblyAI. API docs at assemblyai.com/docs.
AssemblyAI/assemblyai-python-sdk — official SDK