ML-Powered PDF Redaction for Node.js — Remove PII from Any PDF
Permanent PII removal with audit trails. ML-powered detection across 20+ entity types with confidence scoring. HIPAA, GDPR, CCPA compliant.
The Problem
Why PDF Redaction Is Harder Than It Looks
Finding PII is hard. Regex catches patterns like SSNs, but misses context-dependent data like names and addresses. You need ML to close that gap.
Removing it is harder. PDFs weren't built for editing — what looks like "John Smith" on screen might be scattered across multiple internal objects. Most tools just draw black boxes over text, but the original content stays in the file.
Get either side wrong and you have a compliance gap.
The Limitations
- Pattern matching alone misses context-dependent PII like names and addresses
- Overlay-based redaction hides text visually but doesn't remove it from the file
- No confidence scoring — you can't tell good detections from false positives
- No audit trail — you can't prove what was removed or when
What PDFDancer Changes
- ML-powered detection — context-aware entity recognition across 20+ PII types
- True binary-level removal — content permanently deleted, not covered up
- Confidence scores — filter detections by threshold to control precision vs. recall
- Audit trails — verifiable proof of what was redacted and when
See It in Action
PII Redaction in Node.js
ML-powered entity detection across 20+ PII categories with confidence scoring. Filter by threshold to control precision vs. recall.
Comparison
PDFDancer vs. Apryse, Adobe, pdf-lib
| Feature | PDFDancer | Apryse | Adobe PDF Services | pdf-lib |
|---|---|---|---|---|
| ML-Powered PII Detection | ✓ Entity detection with confidence scores | Limited patterns | Cloud-only API | ✗ Text-only, no redaction |
| Permanent Removal | ✓ Binary-level deletion | Annotation-based | Requires separate sanitize | ✗ No redaction support |
| Audit Trail | ✓ Full logging with timestamps | Limited metadata | Per-API-call logging | ✗ No tracking |
| Express/Lambda Support | ✓ Async/await, stateless | Requires heap allocation | HTTP client required | ✓ Browser-focused |
| Self-Hosted | ✓ Yes, on-prem available | ✓ Yes (expensive) | ✗ Cloud-only | ✓ Yes (no redaction) |
| Pricing | Free tier + usage-based | $10K+/year per dev | $$$ per API call | Open source (Apache 2.0) |
Accuracy
ML-Powered Detection Benchmarks
PDFDancer's semantic redaction engine achieves industry-leading accuracy across all PII categories. Powered by purpose-built ML, not generic text search.
| Category | Precision | Recall | F1 Score |
|---|---|---|---|
| Person | 97.43% | 96.28% | 0.969 |
| Dates of Birth | 100.00% | 92.57% | 0.961 |
| Account Number / SSN | 85.27% | 93.93% | 0.894 |
| Addresses | 99.43% | 91.22% | 0.951 |
| Phone / Fax Numbers | 94.12% | 96.30% | 0.952 |
| Email Addresses | 99.58% | 99.98% | 0.998 |
Getting Started
Three Steps to Your First Redaction
Install the Package
npm install pdfdancer-client-typescript
Works with Node.js 14+. TypeScript and JavaScript both supported.
Get Your API Key
Sign up at pdfdancer.com to get a free tier API key. Set it as an environment variable or pass it to PDFDancer.
Run Your First Redaction
Import PDFDancer, open a PDF, configure entity detection, redact, and save. See the code examples above for ready-to-use templates.
Questions
Frequently Asked Questions
Let’s Talk About Your Use Case
15-minute call — we’ll walk through your document pipeline and show how PDFDancer fits.