did - PII Removal
did pseudonymizes PII in text docs for safe LLM use. Replaces names with placeholders/fakes, matches person variants (e.g., John Doe = John D.).
This produces parametric documents: anonymized files with consistent entity placeholders (e.g., [PER1], [ORG1]). Swap parameters post-hoc for bias testing, multilingual prompts, or jurisdiction-specific fakes—enabling reproducible LLM analysis without retraining.
Why did?
- Privacy: Local processing, minimal leakage.
- Context: Tracks entities across variants.
- Bias control: Gender/ethnicity swaps for fairness.
- Legal compliance: De-ID for sharing/analysis.
Features
- NER-based PII detection.
- Configurable replacement (fake names, placeholders).
- Person clustering for consistency.
- CLI for batch/files.
Install
uv pip install https://github.com/evidlabel/did.git
did -hUsage
did input.txt --output output.txt --mode fake --swap-genderOptions:
--mode placeholder|fake: Replace strategy.--swap-gender|ethnicity: Neutralize bias.--cluster: Merge person variants.
Example
Input: “John Doe (JD) sued Jane Smith.”
Output: “[PER1] ([PER1]) sued [PER2].” (or fakes: “Alex Lee sued Pat Kim”)