# AssemblyAI Diarization — Auto-Identify 2-10 Speakers

> AssemblyAI speaker_labels separates 2-10 speakers without enrollment. Per-utterance speaker tags. For meetings, interviews, multi-party calls.

## Install

Save as a script file and run:

## Quick Use

1. `aai.TranscriptionConfig(speaker_labels=True, speakers_expected=N)`
2. `transcript.utterances` returns per-utterance speaker tag
3. For stereo per-speaker, use `dual_channel=True` instead for ~99% accuracy

---

## Intro

AssemblyAI's `speaker_labels=True` flag adds automatic speaker diarization — the transcript splits into utterances, each tagged Speaker A / Speaker B / Speaker C, with no enrollment or known-voice library required. Works in mono or stereo audio, 2-10 speakers reliably. Best for: meeting transcripts, podcast diarization, multi-party call analysis, witness interview indexing. Works with: any audio AssemblyAI can transcribe — file URL, upload, real-time WebSocket. Setup time: 1 minute (just add the flag).

---

### Basic diarization

```python
import assemblyai as aai
aai.settings.api_key = ASSEMBLYAI_KEY

config = aai.TranscriptionConfig(
    speaker_labels=True,
    speakers_expected=4,    # optional hint; helps when there's silence between speakers
)
transcript = aai.Transcriber(config=config).transcribe("meeting.mp3")

for u in transcript.utterances:
    print(f"{u.start//1000:>5}s  Speaker {u.speaker}: {u.text}")
```

### Output structure

```
   0s  Speaker A: Welcome to the May product review.
   8s  Speaker B: Thanks. Let me share my screen.
  14s  Speaker A: Sure, go ahead.
  16s  Speaker C: Before we start, can we agree on the agenda?
  22s  Speaker B: Yeah — I want to cover Q2 launches, then open issues.
```

### Map Speaker letters to real names

After the first pass, the speaker labels are anonymous A/B/C. Map them to people by:

1. **Manual labeling** — show a UI with 30-second clips per speaker, ask the user "Who is this?"
2. **Voice enrollment** — for known recurring callers, compute embeddings once, match new transcripts. Use a separate library (pyannote, NVIDIA NeMo) since AssemblyAI doesn't expose embeddings.
3. **Context-based** — feed first 60 seconds to Claude with attendee list: "Who is each speaker likely to be?"

```python
def map_speakers(transcript, attendees: list[str]) -> dict[str, str]:
    sample = "\n".join(f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances[:8])
    prompt = f"Attendees: {', '.join(attendees)}.\nConversation start:\n{sample}\nReturn JSON: {{'A': name, 'B': name, ...}}"
    # ... call Claude with prompt ...
    return {"A": "Jane", "B": "Bob", "C": "Carlos"}
```

### Tips for accuracy

- **Higher SNR** — clean mics improve diarization 5-10 percentage points
- **Avoid heavy overlap** — overlapping speech is the hardest case; AssemblyAI handles 1-2s overlaps but >3s degrades
- **`speakers_expected`** — if you know the count, pass it; the model uses it as a prior
- **Stereo with per-channel speakers** — set `dual_channel=True` instead; channel becomes the speaker label and accuracy jumps to ~99%

### Real-time diarization?

Real-time WebSocket streaming does NOT include speaker labels in 2026 — only batch transcription does. For real-time speaker ID, use stereo channels (one mic per speaker) and `dual_channel=True`.

---

### FAQ

**Q: Does diarization work on phone calls?**
A: Yes — 8kHz audio is supported. Quality drops slightly vs studio. For Twilio-recorded calls, set `dual_channel=True` if both legs are separate channels (caller on left, callee on right) — accuracy jumps to ~99%.

**Q: How accurate with non-English audio?**
A: Diarization is language-agnostic — it uses acoustic features, not words. Works equally well on French, Mandarin, Arabic. WER for the underlying transcript varies by language but speaker boundaries don't.

**Q: Can I enroll specific known speakers?**
A: Not directly via AssemblyAI. Workaround: run AssemblyAI to get anonymous labels, then use pyannote.audio (open-source) to compute embeddings and match against your enrolled voice library. Combining both is what production call-analytics products typically do.

---

## Source & Thanks

> Built by [AssemblyAI](https://github.com/AssemblyAI). Diarization docs at [assemblyai.com/docs/speech-to-text/speaker-diarization](https://assemblyai.com/docs).
>
> [AssemblyAI/assemblyai-python-sdk](https://github.com/AssemblyAI/assemblyai-python-sdk)

---

<!-- ZH -->

## 快速使用

1. `aai.TranscriptionConfig(speaker_labels=True, speakers_expected=N)`
2. `transcript.utterances` 返回每 utterance 的说话人 tag
3. 立体声每人一声道改用 `dual_channel=True` 拿 ~99% 准确度

---

## 简介

AssemblyAI 的 `speaker_labels=True` 开关加自动说话人分离 —— 转录拆成 utterance，每个标 Speaker A / Speaker B / Speaker C，不用注册或已知嗓音库。单声道或立体声都行，2-10 个说话人稳定。适合会议转录、播客分离、多方通话分析、证人访谈索引。任何 AssemblyAI 能转的音频 —— 文件 URL、上传、实时 WebSocket。装机时间 1 分钟（加这一个 flag）。

---

### 基础分离

```python
import assemblyai as aai
aai.settings.api_key = ASSEMBLYAI_KEY

config = aai.TranscriptionConfig(
    speaker_labels=True,
    speakers_expected=4,    # 可选提示；说话人间有静音时帮忙
)
transcript = aai.Transcriber(config=config).transcribe("meeting.mp3")

for u in transcript.utterances:
    print(f"{u.start//1000:>5}s  Speaker {u.speaker}: {u.text}")
```

### 输出结构

```
   0s  Speaker A: 欢迎参加 5 月产品评审。
   8s  Speaker B: 谢谢。我来共享屏幕。
  14s  Speaker A: 好的，开始吧。
  16s  Speaker C: 开始前我们能定一下议程吗？
  22s  Speaker B: 行 —— 我想过 Q2 发布，然后讲 open 问题。
```

### 把 Speaker 字母映射到真名

第一遍结束后说话人标签是匿名 A/B/C。映射到人靠：

1. **手动标注** —— UI 给每个说话人 30 秒片段，问用户「这是谁？」
2. **嗓音注册** —— 已知重复出现的呼叫者，先算 embedding，新转录匹配。用单独的库（pyannote / NVIDIA NeMo）因为 AssemblyAI 不暴露 embedding。
3. **基于上下文** —— 把前 60 秒带与会者名单喂给 Claude：「每个 speaker 可能是谁？」

```python
def map_speakers(transcript, attendees: list[str]) -> dict[str, str]:
    sample = "\n".join(f"Speaker {u.speaker}: {u.text}" for u in transcript.utterances[:8])
    prompt = f"与会者：{', '.join(attendees)}。\n对话开头：\n{sample}\n返回 JSON：{{'A': name, 'B': name, ...}}"
    # ... 用 prompt 调 Claude ...
    return {"A": "Jane", "B": "Bob", "C": "Carlos"}
```

### 提高准确度的 tip

- **更高信噪比** —— 干净麦克风让分离提升 5-10 个百分点
- **避免大量重叠** —— 重叠说话最难；AssemblyAI 处理 1-2 秒重叠，>3 秒退化
- **`speakers_expected`** —— 知道人数就传；模型当先验用
- **每声道一人** —— 立体声每声道一人，改设 `dual_channel=True`；声道即标签，准确度跳到 ~99%

### 实时分离？

实时 WebSocket 流式 2026 年不支持 speaker labels —— 只批量转录支持。实时说话人 ID 用立体声（每人一支麦克风）+ `dual_channel=True`。

---

### FAQ

**Q: 电话音频分离效果如何？**
A: 支持 8kHz 音频。质量比录音棚略降。Twilio 录的双方通话设 `dual_channel=True`（来电左声道、接听右声道）—— 准确度跳到 ~99%。

**Q: 非英语音频准吗？**
A: 分离跟语言无关 —— 用声学特征不是词。法语 / 普通话 / 阿拉伯语效果同样。底层转录的 WER 因语言而异，但说话人边界不变。

**Q: 能注册特定已知说话人吗？**
A: 不能直接通过 AssemblyAI。变通：跑 AssemblyAI 拿匿名标签，用 pyannote.audio（开源）算 embedding 匹配你注册的嗓音库。生产通话分析产品通常两个一起用。

---

## 来源与感谢

> Built by [AssemblyAI](https://github.com/AssemblyAI). Diarization docs at [assemblyai.com/docs/speech-to-text/speaker-diarization](https://assemblyai.com/docs).
>
> [AssemblyAI/assemblyai-python-sdk](https://github.com/AssemblyAI/assemblyai-python-sdk)


---
Source: https://tokrepo.com/en/workflows/assemblyai-diarization-auto-identify-2-10-speakers
Author: AssemblyAI