KnowledgeMay 11, 2026·4 min read

AssemblyAI Universal-2 — Streaming STT for Voice Agents

AssemblyAI Universal-2 is production STT with <500ms streaming latency, 99 languages, diarization, smart formatting. OpenAI-compat audio.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Stage only · 15/100Stage only
Agent surface
Any MCP/CLI agent
Kind
Knowledge
Install
Stage only
Trust
Trust: New
Entrypoint
Asset
Universal CLI install command
npx tokrepo install 7b08a0b5-b5a1-4586-b32a-616f26d389ec
Intro

Universal-2 is AssemblyAI's latest production STT model — sub-500ms streaming latency, 99 languages, automatic speaker diarization, smart formatting (currency, dates, addresses, profanity filter), and an OpenAI-compatible audio.transcriptions endpoint for drop-in migration. Best for: voice agents on calls, meeting transcription, accessibility captions, multilingual support flows. Works with: Python, Node, Go SDKs; REST; streaming WebSocket; OpenAI-compatible API. Setup time: 5 minutes.


Batch transcription (file)

import assemblyai as aai
aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(
    "meeting.mp3",
    config=aai.TranscriptionConfig(
        speaker_labels=True,
        language_detection=True,
        punctuate=True,
        format_text=True,
        speech_model=aai.SpeechModel.universal,   # Universal-2
    ),
)

for u in transcript.utterances:
    print(f"Speaker {u.speaker}: {u.text}")

Real-time streaming (WebSocket)

import assemblyai as aai

def on_data(transcript: aai.RealtimeTranscript):
    if isinstance(transcript, aai.RealtimeFinalTranscript):
        print(f"FINAL: {transcript.text}")
    else:
        print(f"partial: {transcript.text}")

transcriber = aai.RealtimeTranscriber(
    sample_rate=16_000,
    on_data=on_data,
    on_error=lambda e: print(f"err: {e}"),
)
transcriber.connect()
transcriber.stream(mic_audio_iterator())   # bytes iterator
transcriber.close()

OpenAI-compatible (zero-code migration)

from openai import OpenAI
client = OpenAI(
    base_url="https://api.assemblyai.com/v1",
    api_key=os.environ["ASSEMBLYAI_API_KEY"],
)
transcript = client.audio.transcriptions.create(
    model="universal-2",
    file=open("audio.mp3", "rb"),
    response_format="verbose_json",
)
print(transcript.text)

Feature flags worth knowing

Flag What it does
speaker_labels Diarize 2-10 speakers automatically
auto_chapters Generate chapter summaries every ~5 min
entity_detection Tag PII (person, org, location, card, phone)
pii_redaction Replace detected PII with [REDACTED]
sentiment_analysis Per-sentence sentiment scores
summarization Auto-generate transcript summary
language_detection Detect spoken language, no need to pre-specify

FAQ

Q: Universal-2 vs Whisper-large-v3? A: Universal-2 has better diarization, smart formatting, and per-language tuning — best for production English/Spanish calls. Whisper-large-v3 has broader low-resource language coverage and is open-weight. For voice agents and call centers, Universal-2 typically wins on word error rate and formatting.

Q: How accurate is the speaker diarization? A: On clean two-speaker call audio, ~95% accuracy. Drops to ~85-90% with 4+ speakers, overlapping speech, or heavy background noise. For high-stakes diarization (legal transcripts) review human-in-the-loop on cluster boundaries.

Q: Pricing? A: Streaming: $0.47/hr. Batch async: $0.37/hr (Universal-2 default). Plus add-ons per feature (speaker labels +$0.13/hr, summarization +$0.13/hr, etc). Free $50 trial credit. See assemblyai.com/pricing.


Quick Use

  1. pip install assemblyai
  2. aai.settings.api_key = ASSEMBLYAI_KEY
  3. aai.Transcriber().transcribe(file_or_url) for batch, RealtimeTranscriber for streaming

Intro

Universal-2 is AssemblyAI's latest production STT model — sub-500ms streaming latency, 99 languages, automatic speaker diarization, smart formatting (currency, dates, addresses, profanity filter), and an OpenAI-compatible audio.transcriptions endpoint for drop-in migration. Best for: voice agents on calls, meeting transcription, accessibility captions, multilingual support flows. Works with: Python, Node, Go SDKs; REST; streaming WebSocket; OpenAI-compatible API. Setup time: 5 minutes.


Batch transcription (file)

import assemblyai as aai
aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(
    "meeting.mp3",
    config=aai.TranscriptionConfig(
        speaker_labels=True,
        language_detection=True,
        punctuate=True,
        format_text=True,
        speech_model=aai.SpeechModel.universal,   # Universal-2
    ),
)

for u in transcript.utterances:
    print(f"Speaker {u.speaker}: {u.text}")

Real-time streaming (WebSocket)

import assemblyai as aai

def on_data(transcript: aai.RealtimeTranscript):
    if isinstance(transcript, aai.RealtimeFinalTranscript):
        print(f"FINAL: {transcript.text}")
    else:
        print(f"partial: {transcript.text}")

transcriber = aai.RealtimeTranscriber(
    sample_rate=16_000,
    on_data=on_data,
    on_error=lambda e: print(f"err: {e}"),
)
transcriber.connect()
transcriber.stream(mic_audio_iterator())   # bytes iterator
transcriber.close()

OpenAI-compatible (zero-code migration)

from openai import OpenAI
client = OpenAI(
    base_url="https://api.assemblyai.com/v1",
    api_key=os.environ["ASSEMBLYAI_API_KEY"],
)
transcript = client.audio.transcriptions.create(
    model="universal-2",
    file=open("audio.mp3", "rb"),
    response_format="verbose_json",
)
print(transcript.text)

Feature flags worth knowing

Flag What it does
speaker_labels Diarize 2-10 speakers automatically
auto_chapters Generate chapter summaries every ~5 min
entity_detection Tag PII (person, org, location, card, phone)
pii_redaction Replace detected PII with [REDACTED]
sentiment_analysis Per-sentence sentiment scores
summarization Auto-generate transcript summary
language_detection Detect spoken language, no need to pre-specify

FAQ

Q: Universal-2 vs Whisper-large-v3? A: Universal-2 has better diarization, smart formatting, and per-language tuning — best for production English/Spanish calls. Whisper-large-v3 has broader low-resource language coverage and is open-weight. For voice agents and call centers, Universal-2 typically wins on word error rate and formatting.

Q: How accurate is the speaker diarization? A: On clean two-speaker call audio, ~95% accuracy. Drops to ~85-90% with 4+ speakers, overlapping speech, or heavy background noise. For high-stakes diarization (legal transcripts) review human-in-the-loop on cluster boundaries.

Q: Pricing? A: Streaming: $0.47/hr. Batch async: $0.37/hr (Universal-2 default). Plus add-ons per feature (speaker labels +$0.13/hr, summarization +$0.13/hr, etc). Free $50 trial credit. See assemblyai.com/pricing.


Source & Thanks

Built by AssemblyAI. API docs at assemblyai.com/docs.

AssemblyAI/assemblyai-python-sdk — official SDK

🙏

Source & Thanks

Built by AssemblyAI. API docs at assemblyai.com/docs.

AssemblyAI/assemblyai-python-sdk — official SDK

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets