How do I install AssemblyAI Universal-2 — Streaming STT for Voice Agents?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Esta página se muestra en inglés. Una traducción al español está en curso.

KnowledgeMay 11, 2026·4 min de lectura

AssemblyAI Universal-2 — Streaming STT for Voice Agents

Name: AssemblyAI Universal-2 — Streaming STT for Voice Agents
Author: AssemblyAI

AssemblyAI Universal-2 is production STT with <500ms streaming latency, 99 languages, diarization, smart formatting. OpenAI-compat audio.

AssemblyAI · Community

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Stage only · 15/100Stage only

Superficie agent

Cualquier agent MCP/CLI

Tipo

Knowledge

Instalación

Stage only

Confianza

Confianza: New

Entrada

Asset

Comando CLI universal

npx tokrepo install 7b08a0b5-b5a1-4586-b32a-616f26d389ec

contrato de instalación JSON de metadata plan adaptador contenido raw

Introducción

Universal-2 is AssemblyAI's latest production STT model — sub-500ms streaming latency, 99 languages, automatic speaker diarization, smart formatting (currency, dates, addresses, profanity filter), and an OpenAI-compatible audio.transcriptions endpoint for drop-in migration. Best for: voice agents on calls, meeting transcription, accessibility captions, multilingual support flows. Works with: Python, Node, Go SDKs; REST; streaming WebSocket; OpenAI-compatible API. Setup time: 5 minutes.

Batch transcription (file)

import assemblyai as aai
aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(
    "meeting.mp3",
    config=aai.TranscriptionConfig(
        speaker_labels=True,
        language_detection=True,
        punctuate=True,
        format_text=True,
        speech_model=aai.SpeechModel.universal,   # Universal-2
    ),
)

for u in transcript.utterances:
    print(f"Speaker {u.speaker}: {u.text}")

Real-time streaming (WebSocket)

import assemblyai as aai

def on_data(transcript: aai.RealtimeTranscript):
    if isinstance(transcript, aai.RealtimeFinalTranscript):
        print(f"FINAL: {transcript.text}")
    else:
        print(f"partial: {transcript.text}")

transcriber = aai.RealtimeTranscriber(
    sample_rate=16_000,
    on_data=on_data,
    on_error=lambda e: print(f"err: {e}"),
)
transcriber.connect()
transcriber.stream(mic_audio_iterator())   # bytes iterator
transcriber.close()

OpenAI-compatible (zero-code migration)

from openai import OpenAI
client = OpenAI(
    base_url="https://api.assemblyai.com/v1",
    api_key=os.environ["ASSEMBLYAI_API_KEY"],
)
transcript = client.audio.transcriptions.create(
    model="universal-2",
    file=open("audio.mp3", "rb"),
    response_format="verbose_json",
)
print(transcript.text)

Feature flags worth knowing

Flag	What it does
`speaker_labels`	Diarize 2-10 speakers automatically
`auto_chapters`	Generate chapter summaries every ~5 min
`entity_detection`	Tag PII (person, org, location, card, phone)
`pii_redaction`	Replace detected PII with `[REDACTED]`
`sentiment_analysis`	Per-sentence sentiment scores
`summarization`	Auto-generate transcript summary
`language_detection`	Detect spoken language, no need to pre-specify

FAQ

Q: Universal-2 vs Whisper-large-v3? A: Universal-2 has better diarization, smart formatting, and per-language tuning — best for production English/Spanish calls. Whisper-large-v3 has broader low-resource language coverage and is open-weight. For voice agents and call centers, Universal-2 typically wins on word error rate and formatting.

Q: How accurate is the speaker diarization? A: On clean two-speaker call audio, ~95% accuracy. Drops to ~85-90% with 4+ speakers, overlapping speech, or heavy background noise. For high-stakes diarization (legal transcripts) review human-in-the-loop on cluster boundaries.

Q: Pricing? A: Streaming: $0.47/hr. Batch async: $0.37/hr (Universal-2 default). Plus add-ons per feature (speaker labels +$0.13/hr, summarization +$0.13/hr, etc). Free $50 trial credit. See assemblyai.com/pricing.

Quick Use

pip install assemblyai
aai.settings.api_key = ASSEMBLYAI_KEY
aai.Transcriber().transcribe(file_or_url) for batch, RealtimeTranscriber for streaming

Intro

Batch transcription (file)

import assemblyai as aai
aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(
    "meeting.mp3",
    config=aai.TranscriptionConfig(
        speaker_labels=True,
        language_detection=True,
        punctuate=True,
        format_text=True,
        speech_model=aai.SpeechModel.universal,   # Universal-2
    ),
)

for u in transcript.utterances:
    print(f"Speaker {u.speaker}: {u.text}")

Real-time streaming (WebSocket)

import assemblyai as aai

def on_data(transcript: aai.RealtimeTranscript):
    if isinstance(transcript, aai.RealtimeFinalTranscript):
        print(f"FINAL: {transcript.text}")
    else:
        print(f"partial: {transcript.text}")

transcriber = aai.RealtimeTranscriber(
    sample_rate=16_000,
    on_data=on_data,
    on_error=lambda e: print(f"err: {e}"),
)
transcriber.connect()
transcriber.stream(mic_audio_iterator())   # bytes iterator
transcriber.close()

OpenAI-compatible (zero-code migration)

from openai import OpenAI
client = OpenAI(
    base_url="https://api.assemblyai.com/v1",
    api_key=os.environ["ASSEMBLYAI_API_KEY"],
)
transcript = client.audio.transcriptions.create(
    model="universal-2",
    file=open("audio.mp3", "rb"),
    response_format="verbose_json",
)
print(transcript.text)

Feature flags worth knowing

Flag	What it does
`speaker_labels`	Diarize 2-10 speakers automatically
`auto_chapters`	Generate chapter summaries every ~5 min
`entity_detection`	Tag PII (person, org, location, card, phone)
`pii_redaction`	Replace detected PII with `[REDACTED]`
`sentiment_analysis`	Per-sentence sentiment scores
`summarization`	Auto-generate transcript summary
`language_detection`	Detect spoken language, no need to pre-specify

FAQ

Source & Thanks

Built by AssemblyAI. API docs at assemblyai.com/docs.

AssemblyAI/assemblyai-python-sdk — official SDK

🙏

Fuente y agradecimientos

Built by AssemblyAI. API docs at assemblyai.com/docs.

AssemblyAI/assemblyai-python-sdk — official SDK

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

Deepgram Nova-3 — Production STT with 60ms Partial Latency

Deepgram Nova-3 streams partials in 60ms, finals <300ms. 36 languages, smart formatting, multilingual single-pass. Default for call centers.

Knowledge

Deepgram

SWE-bench — Benchmark for Coding Agents

Evaluate coding agents on real GitHub issues with SWE-bench, including a harness to run and score patch predictions. Compare models and tool stacks.

Knowledge

Agent Toolkit

Self-Evolving Agents Survey — Lifelong Systems

Awesome-Self-Evolving-Agents is a survey collection on self-evolving AI agents and lifelong systems, focusing on feedback, memory, and iteration loops.

Knowledge

Agent Toolkit

Awesome-Memory-for-Agents — Paper List + Taxonomy

Awesome-Memory-for-Agents is a paper list and taxonomy for agent memory, splitting short vs long-term memory and mapping to 3 application scenarios.

Knowledge

AI Open Source