Quick Use
aai.TranscriptionConfig(speaker_labels=True, speakers_expected=N)transcript.utterancesreturns per-utterance speaker tag- For stereo per-speaker, use
dual_channel=Trueinstead for ~99% accuracy
Intro
AssemblyAI's speaker_labels=True flag adds automatic speaker diarization — the transcript splits into utterances, each tagged Speaker A / Speaker B / Speaker C, with no enrollment or known-voice library required. Works in mono or stereo audio, 2-10 speakers reliably. Best for: meeting transcripts, podcast diarization, multi-party call analysis, witness interview indexing. Works with: any audio AssemblyAI can transcribe — file URL, upload, real-time WebSocket. Setup time: 1 minute (just add the flag).
Basic diarization
import assemblyai as aai
aai.settings.api_key = ASSEMBLYAI_KEY
config = aai.TranscriptionConfig(
speaker_labels=True,
speakers_expected=4, # optional hint; helps when there's silence between speakers
)
transcript = aai.Transcriber(config=config).transcribe("meeting.mp3")
for u in transcript.utterances:
print(f"{u.start//1000:>5}s Speaker {u.speaker}: {u.text}")Output structure
0s Speaker A: Welcome to the May product review.
8s Speaker B: Thanks. Let me share my screen.
14s Speaker A: Sure, go ahead.
16s Speaker C: Before we start, can we agree on the agenda?
22s Speaker B: Yeah — I want to cover Q2 launches, then open issues.Map Speaker letters to real names
After the first pass, the speaker labels are anonymous A/B/C. Map them to people by:
- Manual labeling — show a UI with 30-second clips per speaker, ask the user "Who is this?"
- Voice enrollment — for known recurring callers, compute embeddings once, match new transcripts. Use a separate library (pyannote, NVIDIA NeMo) since AssemblyAI doesn't expose embeddings.
- Context-based — feed first 60 seconds to Claude with attendee list: "Who is each speaker likely to be?"
def map_speakers(transcript, attendees: list[str]) -> dict[str, str]:
sample = "\n".join(f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances[:8])
prompt = f"Attendees: {', '.join(attendees)}.\nConversation start:\n{sample}\nReturn JSON: {{'A': name, 'B': name, ...}}"
# ... call Claude with prompt ...
return {"A": "Jane", "B": "Bob", "C": "Carlos"}Tips for accuracy
- Higher SNR — clean mics improve diarization 5-10 percentage points
- Avoid heavy overlap — overlapping speech is the hardest case; AssemblyAI handles 1-2s overlaps but >3s degrades
speakers_expected— if you know the count, pass it; the model uses it as a prior- Stereo with per-channel speakers — set
dual_channel=Trueinstead; channel becomes the speaker label and accuracy jumps to ~99%
Real-time diarization?
Real-time WebSocket streaming does NOT include speaker labels in 2026 — only batch transcription does. For real-time speaker ID, use stereo channels (one mic per speaker) and dual_channel=True.
FAQ
Q: Does diarization work on phone calls?
A: Yes — 8kHz audio is supported. Quality drops slightly vs studio. For Twilio-recorded calls, set dual_channel=True if both legs are separate channels (caller on left, callee on right) — accuracy jumps to ~99%.
Q: How accurate with non-English audio? A: Diarization is language-agnostic — it uses acoustic features, not words. Works equally well on French, Mandarin, Arabic. WER for the underlying transcript varies by language but speaker boundaries don't.
Q: Can I enroll specific known speakers? A: Not directly via AssemblyAI. Workaround: run AssemblyAI to get anonymous labels, then use pyannote.audio (open-source) to compute embeddings and match against your enrolled voice library. Combining both is what production call-analytics products typically do.
Source & Thanks
Built by AssemblyAI. Diarization docs at assemblyai.com/docs/speech-to-text/speaker-diarization.