Is AssemblyAI Diarization — Auto-Identify 2-10 Speakers free to use?

Yes. AssemblyAI Diarization — Auto-Identify 2-10 Speakers is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install AssemblyAI Diarization — Auto-Identify 2-10 Speakers?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

ScriptsMay 11, 2026·5 min read

AssemblyAI Diarization — Auto-Identify 2-10 Speakers

Name: AssemblyAI Diarization — Auto-Identify 2-10 Speakers
Author: AssemblyAI

AssemblyAI speaker_labels separates 2-10 speakers without enrollment. Per-utterance speaker tags. For meetings, interviews, multi-party calls.

AssemblyAI · Community

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Needs Confirmation · 52/100Policy: confirm

Agent surface

Any MCP/CLI agent

Kind

Skill

Install

Single

Trust

Trust: New

Entrypoint

Asset

Universal CLI install command

npx tokrepo install 647a6e2e-a111-41c1-bfa4-229dc2be497d

install contract metadata JSON adapter plan raw content

Intro

AssemblyAI's speaker_labels=True flag adds automatic speaker diarization — the transcript splits into utterances, each tagged Speaker A / Speaker B / Speaker C, with no enrollment or known-voice library required. Works in mono or stereo audio, 2-10 speakers reliably. Best for: meeting transcripts, podcast diarization, multi-party call analysis, witness interview indexing. Works with: any audio AssemblyAI can transcribe — file URL, upload, real-time WebSocket. Setup time: 1 minute (just add the flag).

Basic diarization

import assemblyai as aai
aai.settings.api_key = ASSEMBLYAI_KEY

config = aai.TranscriptionConfig(
    speaker_labels=True,
    speakers_expected=4,    # optional hint; helps when there's silence between speakers
)
transcript = aai.Transcriber(config=config).transcribe("meeting.mp3")

for u in transcript.utterances:
    print(f"{u.start//1000:>5}s  Speaker {u.speaker}: {u.text}")

Output structure

   0s  Speaker A: Welcome to the May product review.
   8s  Speaker B: Thanks. Let me share my screen.
  14s  Speaker A: Sure, go ahead.
  16s  Speaker C: Before we start, can we agree on the agenda?
  22s  Speaker B: Yeah — I want to cover Q2 launches, then open issues.

Map Speaker letters to real names

After the first pass, the speaker labels are anonymous A/B/C. Map them to people by:

Manual labeling — show a UI with 30-second clips per speaker, ask the user "Who is this?"
Voice enrollment — for known recurring callers, compute embeddings once, match new transcripts. Use a separate library (pyannote, NVIDIA NeMo) since AssemblyAI doesn't expose embeddings.
Context-based — feed first 60 seconds to Claude with attendee list: "Who is each speaker likely to be?"

def map_speakers(transcript, attendees: list[str]) -> dict[str, str]:
    sample = "\n".join(f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances[:8])
    prompt = f"Attendees: {', '.join(attendees)}.\nConversation start:\n{sample}\nReturn JSON: {{'A': name, 'B': name, ...}}"
    # ... call Claude with prompt ...
    return {"A": "Jane", "B": "Bob", "C": "Carlos"}

Tips for accuracy

Higher SNR — clean mics improve diarization 5-10 percentage points
Avoid heavy overlap — overlapping speech is the hardest case; AssemblyAI handles 1-2s overlaps but >3s degrades
speakers_expected — if you know the count, pass it; the model uses it as a prior
Stereo with per-channel speakers — set dual_channel=True instead; channel becomes the speaker label and accuracy jumps to ~99%

Real-time diarization?

Real-time WebSocket streaming does NOT include speaker labels in 2026 — only batch transcription does. For real-time speaker ID, use stereo channels (one mic per speaker) and dual_channel=True.

FAQ

Q: Does diarization work on phone calls? A: Yes — 8kHz audio is supported. Quality drops slightly vs studio. For Twilio-recorded calls, set dual_channel=True if both legs are separate channels (caller on left, callee on right) — accuracy jumps to ~99%.

Q: How accurate with non-English audio? A: Diarization is language-agnostic — it uses acoustic features, not words. Works equally well on French, Mandarin, Arabic. WER for the underlying transcript varies by language but speaker boundaries don't.

Q: Can I enroll specific known speakers? A: Not directly via AssemblyAI. Workaround: run AssemblyAI to get anonymous labels, then use pyannote.audio (open-source) to compute embeddings and match against your enrolled voice library. Combining both is what production call-analytics products typically do.

Quick Use

aai.TranscriptionConfig(speaker_labels=True, speakers_expected=N)
transcript.utterances returns per-utterance speaker tag
For stereo per-speaker, use dual_channel=True instead for ~99% accuracy

Intro

Basic diarization

import assemblyai as aai
aai.settings.api_key = ASSEMBLYAI_KEY

config = aai.TranscriptionConfig(
    speaker_labels=True,
    speakers_expected=4,    # optional hint; helps when there's silence between speakers
)
transcript = aai.Transcriber(config=config).transcribe("meeting.mp3")

for u in transcript.utterances:
    print(f"{u.start//1000:>5}s  Speaker {u.speaker}: {u.text}")

Output structure

   0s  Speaker A: Welcome to the May product review.
   8s  Speaker B: Thanks. Let me share my screen.
  14s  Speaker A: Sure, go ahead.
  16s  Speaker C: Before we start, can we agree on the agenda?
  22s  Speaker B: Yeah — I want to cover Q2 launches, then open issues.

Map Speaker letters to real names

After the first pass, the speaker labels are anonymous A/B/C. Map them to people by:

Manual labeling — show a UI with 30-second clips per speaker, ask the user "Who is this?"
Voice enrollment — for known recurring callers, compute embeddings once, match new transcripts. Use a separate library (pyannote, NVIDIA NeMo) since AssemblyAI doesn't expose embeddings.
Context-based — feed first 60 seconds to Claude with attendee list: "Who is each speaker likely to be?"

def map_speakers(transcript, attendees: list[str]) -> dict[str, str]:
    sample = "\n".join(f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances[:8])
    prompt = f"Attendees: {', '.join(attendees)}.\nConversation start:\n{sample}\nReturn JSON: {{'A': name, 'B': name, ...}}"
    # ... call Claude with prompt ...
    return {"A": "Jane", "B": "Bob", "C": "Carlos"}

Tips for accuracy

Higher SNR — clean mics improve diarization 5-10 percentage points
Avoid heavy overlap — overlapping speech is the hardest case; AssemblyAI handles 1-2s overlaps but >3s degrades
speakers_expected — if you know the count, pass it; the model uses it as a prior
Stereo with per-channel speakers — set dual_channel=True instead; channel becomes the speaker label and accuracy jumps to ~99%

Real-time diarization?

Real-time WebSocket streaming does NOT include speaker labels in 2026 — only batch transcription does. For real-time speaker ID, use stereo channels (one mic per speaker) and dual_channel=True.

FAQ

Source & Thanks

Built by AssemblyAI. Diarization docs at assemblyai.com/docs/speech-to-text/speaker-diarization.

AssemblyAI/assemblyai-python-sdk

🙏

Source & Thanks

Built by AssemblyAI. Diarization docs at assemblyai.com/docs/speech-to-text/speaker-diarization.

AssemblyAI/assemblyai-python-sdk

Discussion

No comments yet. Be the first to share your thoughts.

Related Assets

Auto-Sklearn — Automated Machine Learning with Scikit-Learn

Auto-Sklearn is an AutoML toolkit that automatically selects scikit-learn algorithms and tunes hyperparameters using Bayesian optimization, meta-learning, and ensemble construction to build high-accuracy models.

Scripts

Script Depot

LeMUR — Run LLMs Over AssemblyAI Transcripts

LeMUR runs Claude / GPT prompts over AssemblyAI transcripts already in context. Summaries, Q&A, action items, custom JSON extraction.

Skills

AssemblyAI

WhisperX — 70x Faster Speech Recognition

WhisperX provides 70x realtime speech recognition with word-level timestamps and speaker diarization. 21K+ GitHub stars. Batched inference, under 8GB VRAM. BSD-2-Clause.

Scripts

Script Depot

AssemblyAI Universal-2 — Streaming STT for Voice Agents

AssemblyAI Universal-2 is production STT with <500ms streaming latency, 99 languages, diarization, smart formatting. OpenAI-compat audio.

Knowledge

AssemblyAI