# AssemblyAI Diarization — Auto-Identify 2-10 Speakers > AssemblyAI speaker_labels separates 2-10 speakers without enrollment. Per-utterance speaker tags. For meetings, interviews, multi-party calls. ## Install Save as a script file and run: ## Quick Use 1. `aai.TranscriptionConfig(speaker_labels=True, speakers_expected=N)` 2. `transcript.utterances` returns per-utterance speaker tag 3. For stereo per-speaker, use `dual_channel=True` instead for ~99% accuracy --- ## Intro AssemblyAI's `speaker_labels=True` flag adds automatic speaker diarization — the transcript splits into utterances, each tagged Speaker A / Speaker B / Speaker C, with no enrollment or known-voice library required. Works in mono or stereo audio, 2-10 speakers reliably. Best for: meeting transcripts, podcast diarization, multi-party call analysis, witness interview indexing. Works with: any audio AssemblyAI can transcribe — file URL, upload, real-time WebSocket. Setup time: 1 minute (just add the flag). --- ### Basic diarization ```python import assemblyai as aai aai.settings.api_key = ASSEMBLYAI_KEY config = aai.TranscriptionConfig( speaker_labels=True, speakers_expected=4, # optional hint; helps when there's silence between speakers ) transcript = aai.Transcriber(config=config).transcribe("meeting.mp3") for u in transcript.utterances: print(f"{u.start//1000:>5}s Speaker {u.speaker}: {u.text}") ``` ### Output structure ``` 0s Speaker A: Welcome to the May product review. 8s Speaker B: Thanks. Let me share my screen. 14s Speaker A: Sure, go ahead. 16s Speaker C: Before we start, can we agree on the agenda? 22s Speaker B: Yeah — I want to cover Q2 launches, then open issues. ``` ### Map Speaker letters to real names After the first pass, the speaker labels are anonymous A/B/C. Map them to people by: 1. **Manual labeling** — show a UI with 30-second clips per speaker, ask the user "Who is this?" 2. **Voice enrollment** — for known recurring callers, compute embeddings once, match new transcripts. Use a separate library (pyannote, NVIDIA NeMo) since AssemblyAI doesn't expose embeddings. 3. **Context-based** — feed first 60 seconds to Claude with attendee list: "Who is each speaker likely to be?" ```python def map_speakers(transcript, attendees: list[str]) -> dict[str, str]: sample = "\n".join(f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances[:8]) prompt = f"Attendees: {', '.join(attendees)}.\nConversation start:\n{sample}\nReturn JSON: {{'A': name, 'B': name, ...}}" # ... call Claude with prompt ... return {"A": "Jane", "B": "Bob", "C": "Carlos"} ``` ### Tips for accuracy - **Higher SNR** — clean mics improve diarization 5-10 percentage points - **Avoid heavy overlap** — overlapping speech is the hardest case; AssemblyAI handles 1-2s overlaps but >3s degrades - **`speakers_expected`** — if you know the count, pass it; the model uses it as a prior - **Stereo with per-channel speakers** — set `dual_channel=True` instead; channel becomes the speaker label and accuracy jumps to ~99% ### Real-time diarization? Real-time WebSocket streaming does NOT include speaker labels in 2026 — only batch transcription does. For real-time speaker ID, use stereo channels (one mic per speaker) and `dual_channel=True`. --- ### FAQ **Q: Does diarization work on phone calls?** A: Yes — 8kHz audio is supported. Quality drops slightly vs studio. For Twilio-recorded calls, set `dual_channel=True` if both legs are separate channels (caller on left, callee on right) — accuracy jumps to ~99%. **Q: How accurate with non-English audio?** A: Diarization is language-agnostic — it uses acoustic features, not words. Works equally well on French, Mandarin, Arabic. WER for the underlying transcript varies by language but speaker boundaries don't. **Q: Can I enroll specific known speakers?** A: Not directly via AssemblyAI. Workaround: run AssemblyAI to get anonymous labels, then use pyannote.audio (open-source) to compute embeddings and match against your enrolled voice library. Combining both is what production call-analytics products typically do. --- ## Source & Thanks > Built by [AssemblyAI](https://github.com/AssemblyAI). Diarization docs at [assemblyai.com/docs/speech-to-text/speaker-diarization](https://assemblyai.com/docs). > > [AssemblyAI/assemblyai-python-sdk](https://github.com/AssemblyAI/assemblyai-python-sdk) --- ## 快速使用 1. `aai.TranscriptionConfig(speaker_labels=True, speakers_expected=N)` 2. `transcript.utterances` 返回每 utterance 的说话人 tag 3. 立体声每人一声道改用 `dual_channel=True` 拿 ~99% 准确度 --- ## 简介 AssemblyAI 的 `speaker_labels=True` 开关加自动说话人分离 —— 转录拆成 utterance,每个标 Speaker A / Speaker B / Speaker C,不用注册或已知嗓音库。单声道或立体声都行,2-10 个说话人稳定。适合会议转录、播客分离、多方通话分析、证人访谈索引。任何 AssemblyAI 能转的音频 —— 文件 URL、上传、实时 WebSocket。装机时间 1 分钟(加这一个 flag)。 --- ### 基础分离 ```python import assemblyai as aai aai.settings.api_key = ASSEMBLYAI_KEY config = aai.TranscriptionConfig( speaker_labels=True, speakers_expected=4, # 可选提示;说话人间有静音时帮忙 ) transcript = aai.Transcriber(config=config).transcribe("meeting.mp3") for u in transcript.utterances: print(f"{u.start//1000:>5}s Speaker {u.speaker}: {u.text}") ``` ### 输出结构 ``` 0s Speaker A: 欢迎参加 5 月产品评审。 8s Speaker B: 谢谢。我来共享屏幕。 14s Speaker A: 好的,开始吧。 16s Speaker C: 开始前我们能定一下议程吗? 22s Speaker B: 行 —— 我想过 Q2 发布,然后讲 open 问题。 ``` ### 把 Speaker 字母映射到真名 第一遍结束后说话人标签是匿名 A/B/C。映射到人靠: 1. **手动标注** —— UI 给每个说话人 30 秒片段,问用户「这是谁?」 2. **嗓音注册** —— 已知重复出现的呼叫者,先算 embedding,新转录匹配。用单独的库(pyannote / NVIDIA NeMo)因为 AssemblyAI 不暴露 embedding。 3. **基于上下文** —— 把前 60 秒带与会者名单喂给 Claude:「每个 speaker 可能是谁?」 ```python def map_speakers(transcript, attendees: list[str]) -> dict[str, str]: sample = "\n".join(f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances[:8]) prompt = f"与会者:{', '.join(attendees)}。\n对话开头:\n{sample}\n返回 JSON:{{'A': name, 'B': name, ...}}" # ... 用 prompt 调 Claude ... return {"A": "Jane", "B": "Bob", "C": "Carlos"} ``` ### 提高准确度的 tip - **更高信噪比** —— 干净麦克风让分离提升 5-10 个百分点 - **避免大量重叠** —— 重叠说话最难;AssemblyAI 处理 1-2 秒重叠,>3 秒退化 - **`speakers_expected`** —— 知道人数就传;模型当先验用 - **每声道一人** —— 立体声每声道一人,改设 `dual_channel=True`;声道即标签,准确度跳到 ~99% ### 实时分离? 实时 WebSocket 流式 2026 年不支持 speaker labels —— 只批量转录支持。实时说话人 ID 用立体声(每人一支麦克风)+ `dual_channel=True`。 --- ### FAQ **Q: 电话音频分离效果如何?** A: 支持 8kHz 音频。质量比录音棚略降。Twilio 录的双方通话设 `dual_channel=True`(来电左声道、接听右声道)—— 准确度跳到 ~99%。 **Q: 非英语音频准吗?** A: 分离跟语言无关 —— 用声学特征不是词。法语 / 普通话 / 阿拉伯语效果同样。底层转录的 WER 因语言而异,但说话人边界不变。 **Q: 能注册特定已知说话人吗?** A: 不能直接通过 AssemblyAI。变通:跑 AssemblyAI 拿匿名标签,用 pyannote.audio(开源)算 embedding 匹配你注册的嗓音库。生产通话分析产品通常两个一起用。 --- ## 来源与感谢 > Built by [AssemblyAI](https://github.com/AssemblyAI). Diarization docs at [assemblyai.com/docs/speech-to-text/speaker-diarization](https://assemblyai.com/docs). > > [AssemblyAI/assemblyai-python-sdk](https://github.com/AssemblyAI/assemblyai-python-sdk) --- Source: https://tokrepo.com/en/workflows/assemblyai-diarization-auto-identify-2-10-speakers Author: AssemblyAI