ScriptsMay 11, 2026·5 min read

AssemblyAI Diarization — Auto-Identify 2-10 Speakers

AssemblyAI speaker_labels separates 2-10 speakers without enrollment. Per-utterance speaker tags. For meetings, interviews, multi-party calls.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Needs Confirmation · 52/100Policy: confirm
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: New
Entrypoint
Asset
Universal CLI install command
npx tokrepo install 647a6e2e-a111-41c1-bfa4-229dc2be497d
Intro

AssemblyAI's speaker_labels=True flag adds automatic speaker diarization — the transcript splits into utterances, each tagged Speaker A / Speaker B / Speaker C, with no enrollment or known-voice library required. Works in mono or stereo audio, 2-10 speakers reliably. Best for: meeting transcripts, podcast diarization, multi-party call analysis, witness interview indexing. Works with: any audio AssemblyAI can transcribe — file URL, upload, real-time WebSocket. Setup time: 1 minute (just add the flag).


Basic diarization

import assemblyai as aai
aai.settings.api_key = ASSEMBLYAI_KEY

config = aai.TranscriptionConfig(
    speaker_labels=True,
    speakers_expected=4,    # optional hint; helps when there's silence between speakers
)
transcript = aai.Transcriber(config=config).transcribe("meeting.mp3")

for u in transcript.utterances:
    print(f"{u.start//1000:>5}s  Speaker {u.speaker}: {u.text}")

Output structure

   0s  Speaker A: Welcome to the May product review.
   8s  Speaker B: Thanks. Let me share my screen.
  14s  Speaker A: Sure, go ahead.
  16s  Speaker C: Before we start, can we agree on the agenda?
  22s  Speaker B: Yeah  I want to cover Q2 launches, then open issues.

Map Speaker letters to real names

After the first pass, the speaker labels are anonymous A/B/C. Map them to people by:

  1. Manual labeling — show a UI with 30-second clips per speaker, ask the user "Who is this?"
  2. Voice enrollment — for known recurring callers, compute embeddings once, match new transcripts. Use a separate library (pyannote, NVIDIA NeMo) since AssemblyAI doesn't expose embeddings.
  3. Context-based — feed first 60 seconds to Claude with attendee list: "Who is each speaker likely to be?"
def map_speakers(transcript, attendees: list[str]) -> dict[str, str]:
    sample = "\n".join(f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances[:8])
    prompt = f"Attendees: {', '.join(attendees)}.\nConversation start:\n{sample}\nReturn JSON: {{'A': name, 'B': name, ...}}"
    # ... call Claude with prompt ...
    return {"A": "Jane", "B": "Bob", "C": "Carlos"}

Tips for accuracy

  • Higher SNR — clean mics improve diarization 5-10 percentage points
  • Avoid heavy overlap — overlapping speech is the hardest case; AssemblyAI handles 1-2s overlaps but >3s degrades
  • speakers_expected — if you know the count, pass it; the model uses it as a prior
  • Stereo with per-channel speakers — set dual_channel=True instead; channel becomes the speaker label and accuracy jumps to ~99%

Real-time diarization?

Real-time WebSocket streaming does NOT include speaker labels in 2026 — only batch transcription does. For real-time speaker ID, use stereo channels (one mic per speaker) and dual_channel=True.


FAQ

Q: Does diarization work on phone calls? A: Yes — 8kHz audio is supported. Quality drops slightly vs studio. For Twilio-recorded calls, set dual_channel=True if both legs are separate channels (caller on left, callee on right) — accuracy jumps to ~99%.

Q: How accurate with non-English audio? A: Diarization is language-agnostic — it uses acoustic features, not words. Works equally well on French, Mandarin, Arabic. WER for the underlying transcript varies by language but speaker boundaries don't.

Q: Can I enroll specific known speakers? A: Not directly via AssemblyAI. Workaround: run AssemblyAI to get anonymous labels, then use pyannote.audio (open-source) to compute embeddings and match against your enrolled voice library. Combining both is what production call-analytics products typically do.


Quick Use

  1. aai.TranscriptionConfig(speaker_labels=True, speakers_expected=N)
  2. transcript.utterances returns per-utterance speaker tag
  3. For stereo per-speaker, use dual_channel=True instead for ~99% accuracy

Intro

AssemblyAI's speaker_labels=True flag adds automatic speaker diarization — the transcript splits into utterances, each tagged Speaker A / Speaker B / Speaker C, with no enrollment or known-voice library required. Works in mono or stereo audio, 2-10 speakers reliably. Best for: meeting transcripts, podcast diarization, multi-party call analysis, witness interview indexing. Works with: any audio AssemblyAI can transcribe — file URL, upload, real-time WebSocket. Setup time: 1 minute (just add the flag).


Basic diarization

import assemblyai as aai
aai.settings.api_key = ASSEMBLYAI_KEY

config = aai.TranscriptionConfig(
    speaker_labels=True,
    speakers_expected=4,    # optional hint; helps when there's silence between speakers
)
transcript = aai.Transcriber(config=config).transcribe("meeting.mp3")

for u in transcript.utterances:
    print(f"{u.start//1000:>5}s  Speaker {u.speaker}: {u.text}")

Output structure

   0s  Speaker A: Welcome to the May product review.
   8s  Speaker B: Thanks. Let me share my screen.
  14s  Speaker A: Sure, go ahead.
  16s  Speaker C: Before we start, can we agree on the agenda?
  22s  Speaker B: Yeah  I want to cover Q2 launches, then open issues.

Map Speaker letters to real names

After the first pass, the speaker labels are anonymous A/B/C. Map them to people by:

  1. Manual labeling — show a UI with 30-second clips per speaker, ask the user "Who is this?"
  2. Voice enrollment — for known recurring callers, compute embeddings once, match new transcripts. Use a separate library (pyannote, NVIDIA NeMo) since AssemblyAI doesn't expose embeddings.
  3. Context-based — feed first 60 seconds to Claude with attendee list: "Who is each speaker likely to be?"
def map_speakers(transcript, attendees: list[str]) -> dict[str, str]:
    sample = "\n".join(f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances[:8])
    prompt = f"Attendees: {', '.join(attendees)}.\nConversation start:\n{sample}\nReturn JSON: {{'A': name, 'B': name, ...}}"
    # ... call Claude with prompt ...
    return {"A": "Jane", "B": "Bob", "C": "Carlos"}

Tips for accuracy

  • Higher SNR — clean mics improve diarization 5-10 percentage points
  • Avoid heavy overlap — overlapping speech is the hardest case; AssemblyAI handles 1-2s overlaps but >3s degrades
  • speakers_expected — if you know the count, pass it; the model uses it as a prior
  • Stereo with per-channel speakers — set dual_channel=True instead; channel becomes the speaker label and accuracy jumps to ~99%

Real-time diarization?

Real-time WebSocket streaming does NOT include speaker labels in 2026 — only batch transcription does. For real-time speaker ID, use stereo channels (one mic per speaker) and dual_channel=True.


FAQ

Q: Does diarization work on phone calls? A: Yes — 8kHz audio is supported. Quality drops slightly vs studio. For Twilio-recorded calls, set dual_channel=True if both legs are separate channels (caller on left, callee on right) — accuracy jumps to ~99%.

Q: How accurate with non-English audio? A: Diarization is language-agnostic — it uses acoustic features, not words. Works equally well on French, Mandarin, Arabic. WER for the underlying transcript varies by language but speaker boundaries don't.

Q: Can I enroll specific known speakers? A: Not directly via AssemblyAI. Workaround: run AssemblyAI to get anonymous labels, then use pyannote.audio (open-source) to compute embeddings and match against your enrolled voice library. Combining both is what production call-analytics products typically do.


Source & Thanks

Built by AssemblyAI. Diarization docs at assemblyai.com/docs/speech-to-text/speaker-diarization.

AssemblyAI/assemblyai-python-sdk

🙏

Source & Thanks

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets