Cette page est affichée en anglais. Une traduction française est en cours.
KnowledgeMay 11, 2026·4 min de lecture

AssemblyAI Universal-2 — Streaming STT for Voice Agents

AssemblyAI Universal-2 is production STT with <500ms streaming latency, 99 languages, diarization, smart formatting. OpenAI-compat audio.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 15/100Stage only
Surface agent
Tout agent MCP/CLI
Type
Knowledge
Installation
Stage only
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 7b08a0b5-b5a1-4586-b32a-616f26d389ec
Introduction

Universal-2 is AssemblyAI's latest production STT model — sub-500ms streaming latency, 99 languages, automatic speaker diarization, smart formatting (currency, dates, addresses, profanity filter), and an OpenAI-compatible audio.transcriptions endpoint for drop-in migration. Best for: voice agents on calls, meeting transcription, accessibility captions, multilingual support flows. Works with: Python, Node, Go SDKs; REST; streaming WebSocket; OpenAI-compatible API. Setup time: 5 minutes.


Batch transcription (file)

import assemblyai as aai
aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(
    "meeting.mp3",
    config=aai.TranscriptionConfig(
        speaker_labels=True,
        language_detection=True,
        punctuate=True,
        format_text=True,
        speech_model=aai.SpeechModel.universal,   # Universal-2
    ),
)

for u in transcript.utterances:
    print(f"Speaker {u.speaker}: {u.text}")

Real-time streaming (WebSocket)

import assemblyai as aai

def on_data(transcript: aai.RealtimeTranscript):
    if isinstance(transcript, aai.RealtimeFinalTranscript):
        print(f"FINAL: {transcript.text}")
    else:
        print(f"partial: {transcript.text}")

transcriber = aai.RealtimeTranscriber(
    sample_rate=16_000,
    on_data=on_data,
    on_error=lambda e: print(f"err: {e}"),
)
transcriber.connect()
transcriber.stream(mic_audio_iterator())   # bytes iterator
transcriber.close()

OpenAI-compatible (zero-code migration)

from openai import OpenAI
client = OpenAI(
    base_url="https://api.assemblyai.com/v1",
    api_key=os.environ["ASSEMBLYAI_API_KEY"],
)
transcript = client.audio.transcriptions.create(
    model="universal-2",
    file=open("audio.mp3", "rb"),
    response_format="verbose_json",
)
print(transcript.text)

Feature flags worth knowing

Flag What it does
speaker_labels Diarize 2-10 speakers automatically
auto_chapters Generate chapter summaries every ~5 min
entity_detection Tag PII (person, org, location, card, phone)
pii_redaction Replace detected PII with [REDACTED]
sentiment_analysis Per-sentence sentiment scores
summarization Auto-generate transcript summary
language_detection Detect spoken language, no need to pre-specify

FAQ

Q: Universal-2 vs Whisper-large-v3? A: Universal-2 has better diarization, smart formatting, and per-language tuning — best for production English/Spanish calls. Whisper-large-v3 has broader low-resource language coverage and is open-weight. For voice agents and call centers, Universal-2 typically wins on word error rate and formatting.

Q: How accurate is the speaker diarization? A: On clean two-speaker call audio, ~95% accuracy. Drops to ~85-90% with 4+ speakers, overlapping speech, or heavy background noise. For high-stakes diarization (legal transcripts) review human-in-the-loop on cluster boundaries.

Q: Pricing? A: Streaming: $0.47/hr. Batch async: $0.37/hr (Universal-2 default). Plus add-ons per feature (speaker labels +$0.13/hr, summarization +$0.13/hr, etc). Free $50 trial credit. See assemblyai.com/pricing.


Quick Use

  1. pip install assemblyai
  2. aai.settings.api_key = ASSEMBLYAI_KEY
  3. aai.Transcriber().transcribe(file_or_url) for batch, RealtimeTranscriber for streaming

Intro

Universal-2 is AssemblyAI's latest production STT model — sub-500ms streaming latency, 99 languages, automatic speaker diarization, smart formatting (currency, dates, addresses, profanity filter), and an OpenAI-compatible audio.transcriptions endpoint for drop-in migration. Best for: voice agents on calls, meeting transcription, accessibility captions, multilingual support flows. Works with: Python, Node, Go SDKs; REST; streaming WebSocket; OpenAI-compatible API. Setup time: 5 minutes.


Batch transcription (file)

import assemblyai as aai
aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(
    "meeting.mp3",
    config=aai.TranscriptionConfig(
        speaker_labels=True,
        language_detection=True,
        punctuate=True,
        format_text=True,
        speech_model=aai.SpeechModel.universal,   # Universal-2
    ),
)

for u in transcript.utterances:
    print(f"Speaker {u.speaker}: {u.text}")

Real-time streaming (WebSocket)

import assemblyai as aai

def on_data(transcript: aai.RealtimeTranscript):
    if isinstance(transcript, aai.RealtimeFinalTranscript):
        print(f"FINAL: {transcript.text}")
    else:
        print(f"partial: {transcript.text}")

transcriber = aai.RealtimeTranscriber(
    sample_rate=16_000,
    on_data=on_data,
    on_error=lambda e: print(f"err: {e}"),
)
transcriber.connect()
transcriber.stream(mic_audio_iterator())   # bytes iterator
transcriber.close()

OpenAI-compatible (zero-code migration)

from openai import OpenAI
client = OpenAI(
    base_url="https://api.assemblyai.com/v1",
    api_key=os.environ["ASSEMBLYAI_API_KEY"],
)
transcript = client.audio.transcriptions.create(
    model="universal-2",
    file=open("audio.mp3", "rb"),
    response_format="verbose_json",
)
print(transcript.text)

Feature flags worth knowing

Flag What it does
speaker_labels Diarize 2-10 speakers automatically
auto_chapters Generate chapter summaries every ~5 min
entity_detection Tag PII (person, org, location, card, phone)
pii_redaction Replace detected PII with [REDACTED]
sentiment_analysis Per-sentence sentiment scores
summarization Auto-generate transcript summary
language_detection Detect spoken language, no need to pre-specify

FAQ

Q: Universal-2 vs Whisper-large-v3? A: Universal-2 has better diarization, smart formatting, and per-language tuning — best for production English/Spanish calls. Whisper-large-v3 has broader low-resource language coverage and is open-weight. For voice agents and call centers, Universal-2 typically wins on word error rate and formatting.

Q: How accurate is the speaker diarization? A: On clean two-speaker call audio, ~95% accuracy. Drops to ~85-90% with 4+ speakers, overlapping speech, or heavy background noise. For high-stakes diarization (legal transcripts) review human-in-the-loop on cluster boundaries.

Q: Pricing? A: Streaming: $0.47/hr. Batch async: $0.37/hr (Universal-2 default). Plus add-ons per feature (speaker labels +$0.13/hr, summarization +$0.13/hr, etc). Free $50 trial credit. See assemblyai.com/pricing.


Source & Thanks

Built by AssemblyAI. API docs at assemblyai.com/docs.

AssemblyAI/assemblyai-python-sdk — official SDK

🙏

Source et remerciements

Built by AssemblyAI. API docs at assemblyai.com/docs.

AssemblyAI/assemblyai-python-sdk — official SDK

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires