ML-Powered PDF Redaction for Python — Remove PII from Any PDF
Permanent PII removal with audit trails. ML-powered detection across 20+ entity types with confidence scoring. HIPAA, GDPR, CCPA compliant.
The Problem
Why PDF Redaction Is Harder Than It Looks
Finding PII is hard. Regex catches patterns like SSNs, but misses context-dependent data like names and addresses. You need ML to close that gap.
Removing it is harder. PDFs weren't built for editing — what looks like "John Smith" on screen might be scattered across multiple internal objects. Most tools just draw black boxes over text, but the original content stays in the file.
Get either side wrong and you have a compliance gap.
The Limitations
- Pattern matching alone misses context-dependent PII like names and addresses
- Overlay-based redaction hides text visually but doesn't remove it from the file
- No confidence scoring — you can't tell good detections from false positives
- No audit trail — you can't prove what was removed or when
What PDFDancer Changes
- ML-powered detection — context-aware entity recognition across 20+ PII types
- True binary-level removal — content permanently deleted, not covered up
- Confidence scores — filter detections by threshold to control precision vs. recall
- Audit trails — verifiable proof of what was redacted and when
See It in Action
PII Redaction in Python
ML-powered entity detection across 20+ PII categories with confidence scoring. Filter by threshold to control precision vs. recall.
Accuracy
Published Accuracy Numbers
These are real benchmark results from our automated redaction engine on common PII categories. The PDFDancer SDK gives you full control — you set the confidence threshold and decide what to redact.
| Entity Type | Precision | Recall | F1 Score |
|---|---|---|---|
| Person | 97.43% | 96.28% | 0.969 |
| Dates of Birth | 100% | 92.57% | 0.961 |
| Account Number / SSN | 85.27% | 93.93% | 0.894 |
| Addresses | 99.43% | 91.22% | 0.951 |
| Phone / Fax Numbers | 94.12% | 96.3% | 0.952 |
| Email Addresses | 99.58% | 99.98% | 0.998 |
Questions
Frequently Asked Questions
Let’s Talk About Your Use Case
15-minute call — we’ll walk through your document pipeline and show how PDFDancer fits.