How do I install AssemblyAI Universal-2 — Streaming STT for Voice Agents?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Cette page est affichée en anglais. Une traduction française est en cours.

KnowledgeMay 11, 2026·4 min de lecture

AssemblyAI Universal-2 — Streaming STT for Voice Agents

Name: AssemblyAI Universal-2 — Streaming STT for Voice Agents
Author: AssemblyAI

AssemblyAI Universal-2 is production STT with <500ms streaming latency, 99 languages, diarization, smart formatting. OpenAI-compat audio.

AssemblyAI · Community

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 15/100Stage only

Surface agent

Tout agent MCP/CLI

Type

Knowledge

Installation

Stage only

Confiance

Confiance : New

Point d'entrée

Asset

Commande CLI universelle

npx tokrepo install 7b08a0b5-b5a1-4586-b32a-616f26d389ec

contrat d'installation JSON metadata plan adaptateur contenu raw

Introduction

Universal-2 is AssemblyAI's latest production STT model — sub-500ms streaming latency, 99 languages, automatic speaker diarization, smart formatting (currency, dates, addresses, profanity filter), and an OpenAI-compatible audio.transcriptions endpoint for drop-in migration. Best for: voice agents on calls, meeting transcription, accessibility captions, multilingual support flows. Works with: Python, Node, Go SDKs; REST; streaming WebSocket; OpenAI-compatible API. Setup time: 5 minutes.

Batch transcription (file)

import assemblyai as aai
aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(
    "meeting.mp3",
    config=aai.TranscriptionConfig(
        speaker_labels=True,
        language_detection=True,
        punctuate=True,
        format_text=True,
        speech_model=aai.SpeechModel.universal,   # Universal-2
    ),
)

for u in transcript.utterances:
    print(f"Speaker {u.speaker}: {u.text}")

Real-time streaming (WebSocket)

import assemblyai as aai

def on_data(transcript: aai.RealtimeTranscript):
    if isinstance(transcript, aai.RealtimeFinalTranscript):
        print(f"FINAL: {transcript.text}")
    else:
        print(f"partial: {transcript.text}")

transcriber = aai.RealtimeTranscriber(
    sample_rate=16_000,
    on_data=on_data,
    on_error=lambda e: print(f"err: {e}"),
)
transcriber.connect()
transcriber.stream(mic_audio_iterator())   # bytes iterator
transcriber.close()

OpenAI-compatible (zero-code migration)

from openai import OpenAI
client = OpenAI(
    base_url="https://api.assemblyai.com/v1",
    api_key=os.environ["ASSEMBLYAI_API_KEY"],
)
transcript = client.audio.transcriptions.create(
    model="universal-2",
    file=open("audio.mp3", "rb"),
    response_format="verbose_json",
)
print(transcript.text)

Feature flags worth knowing

Flag	What it does
`speaker_labels`	Diarize 2-10 speakers automatically
`auto_chapters`	Generate chapter summaries every ~5 min
`entity_detection`	Tag PII (person, org, location, card, phone)
`pii_redaction`	Replace detected PII with `[REDACTED]`
`sentiment_analysis`	Per-sentence sentiment scores
`summarization`	Auto-generate transcript summary
`language_detection`	Detect spoken language, no need to pre-specify

FAQ

Q: Universal-2 vs Whisper-large-v3? A: Universal-2 has better diarization, smart formatting, and per-language tuning — best for production English/Spanish calls. Whisper-large-v3 has broader low-resource language coverage and is open-weight. For voice agents and call centers, Universal-2 typically wins on word error rate and formatting.

Q: How accurate is the speaker diarization? A: On clean two-speaker call audio, ~95% accuracy. Drops to ~85-90% with 4+ speakers, overlapping speech, or heavy background noise. For high-stakes diarization (legal transcripts) review human-in-the-loop on cluster boundaries.

Q: Pricing? A: Streaming: $0.47/hr. Batch async: $0.37/hr (Universal-2 default). Plus add-ons per feature (speaker labels +$0.13/hr, summarization +$0.13/hr, etc). Free $50 trial credit. See assemblyai.com/pricing.

Quick Use

pip install assemblyai
aai.settings.api_key = ASSEMBLYAI_KEY
aai.Transcriber().transcribe(file_or_url) for batch, RealtimeTranscriber for streaming

Intro

Batch transcription (file)

import assemblyai as aai
aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(
    "meeting.mp3",
    config=aai.TranscriptionConfig(
        speaker_labels=True,
        language_detection=True,
        punctuate=True,
        format_text=True,
        speech_model=aai.SpeechModel.universal,   # Universal-2
    ),
)

for u in transcript.utterances:
    print(f"Speaker {u.speaker}: {u.text}")

Real-time streaming (WebSocket)

import assemblyai as aai

def on_data(transcript: aai.RealtimeTranscript):
    if isinstance(transcript, aai.RealtimeFinalTranscript):
        print(f"FINAL: {transcript.text}")
    else:
        print(f"partial: {transcript.text}")

transcriber = aai.RealtimeTranscriber(
    sample_rate=16_000,
    on_data=on_data,
    on_error=lambda e: print(f"err: {e}"),
)
transcriber.connect()
transcriber.stream(mic_audio_iterator())   # bytes iterator
transcriber.close()

OpenAI-compatible (zero-code migration)

from openai import OpenAI
client = OpenAI(
    base_url="https://api.assemblyai.com/v1",
    api_key=os.environ["ASSEMBLYAI_API_KEY"],
)
transcript = client.audio.transcriptions.create(
    model="universal-2",
    file=open("audio.mp3", "rb"),
    response_format="verbose_json",
)
print(transcript.text)

Feature flags worth knowing

Flag	What it does
`speaker_labels`	Diarize 2-10 speakers automatically
`auto_chapters`	Generate chapter summaries every ~5 min
`entity_detection`	Tag PII (person, org, location, card, phone)
`pii_redaction`	Replace detected PII with `[REDACTED]`
`sentiment_analysis`	Per-sentence sentiment scores
`summarization`	Auto-generate transcript summary
`language_detection`	Detect spoken language, no need to pre-specify

FAQ

Source & Thanks

Built by AssemblyAI. API docs at assemblyai.com/docs.

AssemblyAI/assemblyai-python-sdk — official SDK

🙏

Source et remerciements

Built by AssemblyAI. API docs at assemblyai.com/docs.

AssemblyAI/assemblyai-python-sdk — official SDK

Fil de discussion

Connectez-vous pour rejoindre la discussion.

Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires

Deepgram Nova-3 — Production STT with 60ms Partial Latency

Deepgram Nova-3 streams partials in 60ms, finals <300ms. 36 languages, smart formatting, multilingual single-pass. Default for call centers.

Knowledge

Deepgram

SWE-bench — Benchmark for Coding Agents

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

Knowledge

Agent Toolkit

Self-Evolving Agents Survey — Lifelong Systems

Awesome-Self-Evolving-Agents is a survey collection on self-evolving AI agents and lifelong systems, focusing on feedback, memory, and iteration loops.

Knowledge

Agent Toolkit

Awesome-Memory-for-Agents — Paper List + Taxonomy

Awesome-Memory-for-Agents is a paper list and taxonomy for agent memory, splitting short vs long-term memory and mapping to 3 application scenarios.

Knowledge

AI Open Source