Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsMay 11, 2026·5 min de lecture

AssemblyAI Diarization — Auto-Identify 2-10 Speakers

AssemblyAI speaker_labels separates 2-10 speakers without enrollment. Per-utterance speaker tags. For meetings, interviews, multi-party calls.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Needs Confirmation · 52/100Policy : confirmer
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 647a6e2e-a111-41c1-bfa4-229dc2be497d
Introduction

AssemblyAI's speaker_labels=True flag adds automatic speaker diarization — the transcript splits into utterances, each tagged Speaker A / Speaker B / Speaker C, with no enrollment or known-voice library required. Works in mono or stereo audio, 2-10 speakers reliably. Best for: meeting transcripts, podcast diarization, multi-party call analysis, witness interview indexing. Works with: any audio AssemblyAI can transcribe — file URL, upload, real-time WebSocket. Setup time: 1 minute (just add the flag).


Basic diarization

import assemblyai as aai
aai.settings.api_key = ASSEMBLYAI_KEY

config = aai.TranscriptionConfig(
    speaker_labels=True,
    speakers_expected=4,    # optional hint; helps when there's silence between speakers
)
transcript = aai.Transcriber(config=config).transcribe("meeting.mp3")

for u in transcript.utterances:
    print(f"{u.start//1000:>5}s  Speaker {u.speaker}: {u.text}")

Output structure

   0s  Speaker A: Welcome to the May product review.
   8s  Speaker B: Thanks. Let me share my screen.
  14s  Speaker A: Sure, go ahead.
  16s  Speaker C: Before we start, can we agree on the agenda?
  22s  Speaker B: Yeah  I want to cover Q2 launches, then open issues.

Map Speaker letters to real names

After the first pass, the speaker labels are anonymous A/B/C. Map them to people by:

  1. Manual labeling — show a UI with 30-second clips per speaker, ask the user "Who is this?"
  2. Voice enrollment — for known recurring callers, compute embeddings once, match new transcripts. Use a separate library (pyannote, NVIDIA NeMo) since AssemblyAI doesn't expose embeddings.
  3. Context-based — feed first 60 seconds to Claude with attendee list: "Who is each speaker likely to be?"
def map_speakers(transcript, attendees: list[str]) -> dict[str, str]:
    sample = "\n".join(f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances[:8])
    prompt = f"Attendees: {', '.join(attendees)}.\nConversation start:\n{sample}\nReturn JSON: {{'A': name, 'B': name, ...}}"
    # ... call Claude with prompt ...
    return {"A": "Jane", "B": "Bob", "C": "Carlos"}

Tips for accuracy

  • Higher SNR — clean mics improve diarization 5-10 percentage points
  • Avoid heavy overlap — overlapping speech is the hardest case; AssemblyAI handles 1-2s overlaps but >3s degrades
  • speakers_expected — if you know the count, pass it; the model uses it as a prior
  • Stereo with per-channel speakers — set dual_channel=True instead; channel becomes the speaker label and accuracy jumps to ~99%

Real-time diarization?

Real-time WebSocket streaming does NOT include speaker labels in 2026 — only batch transcription does. For real-time speaker ID, use stereo channels (one mic per speaker) and dual_channel=True.


FAQ

Q: Does diarization work on phone calls? A: Yes — 8kHz audio is supported. Quality drops slightly vs studio. For Twilio-recorded calls, set dual_channel=True if both legs are separate channels (caller on left, callee on right) — accuracy jumps to ~99%.

Q: How accurate with non-English audio? A: Diarization is language-agnostic — it uses acoustic features, not words. Works equally well on French, Mandarin, Arabic. WER for the underlying transcript varies by language but speaker boundaries don't.

Q: Can I enroll specific known speakers? A: Not directly via AssemblyAI. Workaround: run AssemblyAI to get anonymous labels, then use pyannote.audio (open-source) to compute embeddings and match against your enrolled voice library. Combining both is what production call-analytics products typically do.


Quick Use

  1. aai.TranscriptionConfig(speaker_labels=True, speakers_expected=N)
  2. transcript.utterances returns per-utterance speaker tag
  3. For stereo per-speaker, use dual_channel=True instead for ~99% accuracy

Intro

AssemblyAI's speaker_labels=True flag adds automatic speaker diarization — the transcript splits into utterances, each tagged Speaker A / Speaker B / Speaker C, with no enrollment or known-voice library required. Works in mono or stereo audio, 2-10 speakers reliably. Best for: meeting transcripts, podcast diarization, multi-party call analysis, witness interview indexing. Works with: any audio AssemblyAI can transcribe — file URL, upload, real-time WebSocket. Setup time: 1 minute (just add the flag).


Basic diarization

import assemblyai as aai
aai.settings.api_key = ASSEMBLYAI_KEY

config = aai.TranscriptionConfig(
    speaker_labels=True,
    speakers_expected=4,    # optional hint; helps when there's silence between speakers
)
transcript = aai.Transcriber(config=config).transcribe("meeting.mp3")

for u in transcript.utterances:
    print(f"{u.start//1000:>5}s  Speaker {u.speaker}: {u.text}")

Output structure

   0s  Speaker A: Welcome to the May product review.
   8s  Speaker B: Thanks. Let me share my screen.
  14s  Speaker A: Sure, go ahead.
  16s  Speaker C: Before we start, can we agree on the agenda?
  22s  Speaker B: Yeah  I want to cover Q2 launches, then open issues.

Map Speaker letters to real names

After the first pass, the speaker labels are anonymous A/B/C. Map them to people by:

  1. Manual labeling — show a UI with 30-second clips per speaker, ask the user "Who is this?"
  2. Voice enrollment — for known recurring callers, compute embeddings once, match new transcripts. Use a separate library (pyannote, NVIDIA NeMo) since AssemblyAI doesn't expose embeddings.
  3. Context-based — feed first 60 seconds to Claude with attendee list: "Who is each speaker likely to be?"
def map_speakers(transcript, attendees: list[str]) -> dict[str, str]:
    sample = "\n".join(f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances[:8])
    prompt = f"Attendees: {', '.join(attendees)}.\nConversation start:\n{sample}\nReturn JSON: {{'A': name, 'B': name, ...}}"
    # ... call Claude with prompt ...
    return {"A": "Jane", "B": "Bob", "C": "Carlos"}

Tips for accuracy

  • Higher SNR — clean mics improve diarization 5-10 percentage points
  • Avoid heavy overlap — overlapping speech is the hardest case; AssemblyAI handles 1-2s overlaps but >3s degrades
  • speakers_expected — if you know the count, pass it; the model uses it as a prior
  • Stereo with per-channel speakers — set dual_channel=True instead; channel becomes the speaker label and accuracy jumps to ~99%

Real-time diarization?

Real-time WebSocket streaming does NOT include speaker labels in 2026 — only batch transcription does. For real-time speaker ID, use stereo channels (one mic per speaker) and dual_channel=True.


FAQ

Q: Does diarization work on phone calls? A: Yes — 8kHz audio is supported. Quality drops slightly vs studio. For Twilio-recorded calls, set dual_channel=True if both legs are separate channels (caller on left, callee on right) — accuracy jumps to ~99%.

Q: How accurate with non-English audio? A: Diarization is language-agnostic — it uses acoustic features, not words. Works equally well on French, Mandarin, Arabic. WER for the underlying transcript varies by language but speaker boundaries don't.

Q: Can I enroll specific known speakers? A: Not directly via AssemblyAI. Workaround: run AssemblyAI to get anonymous labels, then use pyannote.audio (open-source) to compute embeddings and match against your enrolled voice library. Combining both is what production call-analytics products typically do.


Source & Thanks

Built by AssemblyAI. Diarization docs at assemblyai.com/docs/speech-to-text/speaker-diarization.

AssemblyAI/assemblyai-python-sdk

🙏

Source et remerciements

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires