Practical Notes
For production, treat PII sanitization as a policy: define what counts as PII for your domain, add allowlists for non-sensitive identifiers, and write regression tests with real-ish examples. Use Presidio as a pre-processor before prompts and embeddings, and consider sanitizing outputs as well when users paste secrets.
Safety note: PII detection is probabilistic—combine rules, tests, and human review for high-stakes data flows.
FAQ
Q: Why use it with LLMs? A: It reduces the chance of leaking personal data to model providers, logs, or downstream tools.
Q: Is it only for text? A: This repo focuses on PII anonymization tooling; follow the docs for supported modalities and deployments.
Q: Where should I integrate it? A: Integrate in your request middleware and also sanitize transcripts before storage or embeddings.