# Presidio — Detect and Anonymize PII

> Detect and anonymize PII in text with Microsoft Presidio, then feed sanitized inputs to LLMs to reduce leakage risk. Works via pip or Docker deployments.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

# Presidio — Detect and Anonymize PII

> Detect and anonymize PII in text with Microsoft Presidio, then feed sanitized inputs to LLMs to reduce leakage risk. Works via pip or Docker deployments.

## Quick Use

1. Install:
   ```bash
   pip install presidio_analyzer presidio_anonymizer
   ```
2. Run:
   ```bash
   python -c "import presidio_analyzer, presidio_anonymizer; print('presidio ok')"
   ```
3. Verify:
   - Run analyzer on a sample string with an email/phone number and confirm detections are anonymized or redacted.


---

## Intro

Detect and anonymize PII in text with Microsoft Presidio, then feed sanitized inputs to LLMs to reduce leakage risk. Works via pip or Docker deployments.

- **Best for:** LLM apps handling customer data that need PII de-identification before prompts, logs, or embeddings
- **Works with:** Python, text pipelines, pre-processing for prompts/logging/indexing; optional Docker services
- **Setup time:** 18 minutes


### Quantitative Notes

- Setup time ~18 minutes (pip install + download one NLP model if needed)
- GitHub stars + forks (verified): see Source & Thanks
- Common pattern: sanitize inputs + sanitize outputs + sanitize logs (3 enforcement points)


---

## Practical Notes

For production, treat PII sanitization as a policy: define what counts as PII for your domain, add allowlists for non-sensitive identifiers, and write regression tests with real-ish examples. Use Presidio as a pre-processor before prompts and embeddings, and consider sanitizing outputs as well when users paste secrets.

**Safety note:** PII detection is probabilistic—combine rules, tests, and human review for high-stakes data flows.

### FAQ

**Q: Why use it with LLMs?**
A: It reduces the chance of leaking personal data to model providers, logs, or downstream tools.

**Q: Is it only for text?**
A: This repo focuses on PII anonymization tooling; follow the docs for supported modalities and deployments.

**Q: Where should I integrate it?**
A: Integrate in your request middleware and also sanitize transcripts before storage or embeddings.

---

## Source & Thanks

> GitHub: https://github.com/microsoft/presidio
> Owner avatar: https://avatars.githubusercontent.com/u/6154722?v=4
> License (SPDX): MIT
> GitHub stars (verified via `api.github.com/repos/microsoft/presidio`): 8,019
> GitHub forks (verified via `api.github.com/repos/microsoft/presidio`): 1,041


---

<!-- ZH -->

# Presidio——PII 检测与匿名化工具包

> 用 Microsoft Presidio 在文本中检测并匿名化 PII，再把脱敏后的内容交给 LLM，降低数据泄露风险；同时支持 pip 安装与 Docker 部署，便于在生产链路稳定落地。

## 快速使用

1. 安装：
   ```bash
   pip install presidio_analyzer presidio_anonymizer
   ```
2. 运行：
   ```bash
   python -c "import presidio_analyzer, presidio_anonymizer; print('presidio ok')"
   ```
3. 验证：
   - Run analyzer on a sample string with an email/phone number and confirm detections are anonymized or redacted.


---

## 简介

用 Microsoft Presidio 在文本中检测并匿名化 PII，再把脱敏后的内容交给 LLM，降低数据泄露风险；同时支持 pip 安装与 Docker 部署，便于在生产链路稳定落地。

- **适合谁（Best for）:** 会处理客户数据的 LLM 应用，需要在 prompt/日志/向量化前做 PII 脱敏的团队
- **兼容工具（Works with）:** Python、文本处理流水线、用于 prompt/日志/索引前的预处理；也可用 Docker 服务化
- **安装时间（Setup time）:** 18 分钟


### 量化信息

- 跑通约 18 分钟（pip 安装 + 按需下载一个 NLP 模型）
- GitHub stars + forks（已核验）：见「来源与感谢」
- 常见做法：输入脱敏 + 输出脱敏 + 日志脱敏（3 个强制点）


---

## 实战要点

生产落地要把 PII 脱敏当成“策略”：明确你领域里的 PII 范围，为非敏感标识符建立白名单，并用接近真实的数据写回归测试。把 Presidio 放在 prompt 与向量化之前做预处理；用户粘贴机密时，也建议对输出再做一次脱敏。

**安全提示：** PII 检测具有概率性；对高风险数据流需结合规则、测试与人工复核。

### FAQ

**Q: 为什么要和 LLM 一起用？**
A: 可以降低个人信息泄露到模型供应商、日志或下游工具的概率。

**Q: 它只支持文本吗？**
A: 仓库主要提供 PII 匿名化工具链；具体支持范围与部署方式以官方文档为准。

**Q: 应该集成在哪？**
A: 建议在请求入口做中间件，并在落库/向量化前对对话记录再做一次脱敏。

---

## 来源与感谢

> GitHub：https://github.com/microsoft/presidio
> Owner avatar：https://avatars.githubusercontent.com/u/6154722?v=4
> 许可证（SPDX）：MIT
> GitHub stars（已通过 `api.github.com/repos/microsoft/presidio` 核验）：8,019
> GitHub forks（已通过 `api.github.com/repos/microsoft/presidio` 核验）：1,041


---
Source: https://tokrepo.com/en/workflows/presidio-detect-and-anonymize-pii
Author: Script Depot