Is AssemblyAI Diarization — Auto-Identify 2-10 Speakers free to use?

Yes. AssemblyAI Diarization — Auto-Identify 2-10 Speakers is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install AssemblyAI Diarization — Auto-Identify 2-10 Speakers?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Cette page est affichée en anglais. Une traduction française est en cours.

ScriptsMay 11, 2026·5 min de lecture

AssemblyAI Diarization — Auto-Identify 2-10 Speakers

Name: AssemblyAI Diarization — Auto-Identify 2-10 Speakers
Author: AssemblyAI

AssemblyAI speaker_labels separates 2-10 speakers without enrollment. Per-utterance speaker tags. For meetings, interviews, multi-party calls.

AssemblyAI · Community

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Needs Confirmation · 52/100Policy : confirmer

Surface agent

Tout agent MCP/CLI

Type

Skill

Installation

Single

Confiance

Confiance : New

Point d'entrée

Asset

Commande CLI universelle

npx tokrepo install 647a6e2e-a111-41c1-bfa4-229dc2be497d

contrat d'installation JSON metadata plan adaptateur contenu raw

Introduction

AssemblyAI's speaker_labels=True flag adds automatic speaker diarization — the transcript splits into utterances, each tagged Speaker A / Speaker B / Speaker C, with no enrollment or known-voice library required. Works in mono or stereo audio, 2-10 speakers reliably. Best for: meeting transcripts, podcast diarization, multi-party call analysis, witness interview indexing. Works with: any audio AssemblyAI can transcribe — file URL, upload, real-time WebSocket. Setup time: 1 minute (just add the flag).

Basic diarization

import assemblyai as aai
aai.settings.api_key = ASSEMBLYAI_KEY

config = aai.TranscriptionConfig(
    speaker_labels=True,
    speakers_expected=4,    # optional hint; helps when there's silence between speakers
)
transcript = aai.Transcriber(config=config).transcribe("meeting.mp3")

for u in transcript.utterances:
    print(f"{u.start//1000:>5}s  Speaker {u.speaker}: {u.text}")

Output structure

   0s  Speaker A: Welcome to the May product review.
   8s  Speaker B: Thanks. Let me share my screen.
  14s  Speaker A: Sure, go ahead.
  16s  Speaker C: Before we start, can we agree on the agenda?
  22s  Speaker B: Yeah — I want to cover Q2 launches, then open issues.

Map Speaker letters to real names

After the first pass, the speaker labels are anonymous A/B/C. Map them to people by:

Manual labeling — show a UI with 30-second clips per speaker, ask the user "Who is this?"
Voice enrollment — for known recurring callers, compute embeddings once, match new transcripts. Use a separate library (pyannote, NVIDIA NeMo) since AssemblyAI doesn't expose embeddings.
Context-based — feed first 60 seconds to Claude with attendee list: "Who is each speaker likely to be?"

def map_speakers(transcript, attendees: list[str]) -> dict[str, str]:
    sample = "\n".join(f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances[:8])
    prompt = f"Attendees: {', '.join(attendees)}.\nConversation start:\n{sample}\nReturn JSON: {{'A': name, 'B': name, ...}}"
    # ... call Claude with prompt ...
    return {"A": "Jane", "B": "Bob", "C": "Carlos"}

Tips for accuracy

Higher SNR — clean mics improve diarization 5-10 percentage points
Avoid heavy overlap — overlapping speech is the hardest case; AssemblyAI handles 1-2s overlaps but >3s degrades
speakers_expected — if you know the count, pass it; the model uses it as a prior
Stereo with per-channel speakers — set dual_channel=True instead; channel becomes the speaker label and accuracy jumps to ~99%

Real-time diarization?

Real-time WebSocket streaming does NOT include speaker labels in 2026 — only batch transcription does. For real-time speaker ID, use stereo channels (one mic per speaker) and dual_channel=True.

FAQ

Q: Does diarization work on phone calls? A: Yes — 8kHz audio is supported. Quality drops slightly vs studio. For Twilio-recorded calls, set dual_channel=True if both legs are separate channels (caller on left, callee on right) — accuracy jumps to ~99%.

Q: How accurate with non-English audio? A: Diarization is language-agnostic — it uses acoustic features, not words. Works equally well on French, Mandarin, Arabic. WER for the underlying transcript varies by language but speaker boundaries don't.

Q: Can I enroll specific known speakers? A: Not directly via AssemblyAI. Workaround: run AssemblyAI to get anonymous labels, then use pyannote.audio (open-source) to compute embeddings and match against your enrolled voice library. Combining both is what production call-analytics products typically do.

Quick Use

aai.TranscriptionConfig(speaker_labels=True, speakers_expected=N)
transcript.utterances returns per-utterance speaker tag
For stereo per-speaker, use dual_channel=True instead for ~99% accuracy

Intro

Basic diarization

import assemblyai as aai
aai.settings.api_key = ASSEMBLYAI_KEY

config = aai.TranscriptionConfig(
    speaker_labels=True,
    speakers_expected=4,    # optional hint; helps when there's silence between speakers
)
transcript = aai.Transcriber(config=config).transcribe("meeting.mp3")

for u in transcript.utterances:
    print(f"{u.start//1000:>5}s  Speaker {u.speaker}: {u.text}")

Output structure

   0s  Speaker A: Welcome to the May product review.
   8s  Speaker B: Thanks. Let me share my screen.
  14s  Speaker A: Sure, go ahead.
  16s  Speaker C: Before we start, can we agree on the agenda?
  22s  Speaker B: Yeah — I want to cover Q2 launches, then open issues.

Map Speaker letters to real names

After the first pass, the speaker labels are anonymous A/B/C. Map them to people by:

Manual labeling — show a UI with 30-second clips per speaker, ask the user "Who is this?"
Voice enrollment — for known recurring callers, compute embeddings once, match new transcripts. Use a separate library (pyannote, NVIDIA NeMo) since AssemblyAI doesn't expose embeddings.
Context-based — feed first 60 seconds to Claude with attendee list: "Who is each speaker likely to be?"

def map_speakers(transcript, attendees: list[str]) -> dict[str, str]:
    sample = "\n".join(f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances[:8])
    prompt = f"Attendees: {', '.join(attendees)}.\nConversation start:\n{sample}\nReturn JSON: {{'A': name, 'B': name, ...}}"
    # ... call Claude with prompt ...
    return {"A": "Jane", "B": "Bob", "C": "Carlos"}

Tips for accuracy

Higher SNR — clean mics improve diarization 5-10 percentage points
Avoid heavy overlap — overlapping speech is the hardest case; AssemblyAI handles 1-2s overlaps but >3s degrades
speakers_expected — if you know the count, pass it; the model uses it as a prior
Stereo with per-channel speakers — set dual_channel=True instead; channel becomes the speaker label and accuracy jumps to ~99%

Real-time diarization?

Real-time WebSocket streaming does NOT include speaker labels in 2026 — only batch transcription does. For real-time speaker ID, use stereo channels (one mic per speaker) and dual_channel=True.

FAQ

Source & Thanks

Built by AssemblyAI. Diarization docs at assemblyai.com/docs/speech-to-text/speaker-diarization.

AssemblyAI/assemblyai-python-sdk

🙏

Source et remerciements

Built by AssemblyAI. Diarization docs at assemblyai.com/docs/speech-to-text/speaker-diarization.

AssemblyAI/assemblyai-python-sdk

Fil de discussion

Connectez-vous pour rejoindre la discussion.

Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires

Auto-Sklearn — Automated Machine Learning with Scikit-Learn

Auto-Sklearn is an AutoML toolkit that automatically selects scikit-learn algorithms and tunes hyperparameters using Bayesian optimization, meta-learning, and ensemble construction to build high-accuracy models.

Scripts

Script Depot

LeMUR — Run LLMs Over AssemblyAI Transcripts

LeMUR runs Claude / GPT prompts over AssemblyAI transcripts already in context. Summaries, Q&A, action items, custom JSON extraction.

Skills

AssemblyAI

WhisperX — 70x Faster Speech Recognition

WhisperX provides 70x realtime speech recognition with word-level timestamps and speaker diarization. 21K+ GitHub stars. Batched inference, under 8GB VRAM. BSD-2-Clause.

Scripts

Script Depot

AssemblyAI Universal-2 — Streaming STT for Voice Agents

AssemblyAI Universal-2 is production STT with <500ms streaming latency, 99 languages, diarization, smart formatting. OpenAI-compat audio.

Knowledge

AssemblyAI