Sanitize
Standalone content sanitization API — scan text and files for PII, PHI, credentials, and healthcare entities. Redact, tokenize, block, or flag. Works with any model.
Aira Sanitize is a standalone content-cleaning API. Send text or upload a file, pick a mode, get back clean output with a full findings report. No policy engine, no action workflow — just scan and clean.
Sanitize vs Content Scan Policies — these are different features that share the same detection engine. See which one to use below.
How it works
Every sanitize call runs through a multi-layer detection pipeline:
Input (text or file)
|
[1. Regex patterns] — SSN, credit card, API keys, MRN, NPI
|
[2. Presidio NER] — person names, addresses, dates, organizations
|
[3. Healthcare detectors] — ICD-10, drug names, DEA (HIPAA policy only)
|
[4. AI review (optional)] — LLM catches what deterministic scanners miss
|
Merged + deduplicated spans
|
[Mode: redact / tokenize / block / flag]
|
Output + findings reportAll layers run in a single pass. Results are merged by position — if regex and NER both find the same SSN, the higher-confidence match wins.
For files (images, PDFs, DICOM)
Image redaction uses a unified Presidio pipeline with all Aira recognizers registered as custom PatternRecognizer objects. Presidio handles OCR, entity detection, span-to-pixel mapping, and redaction in one pass.
Small images are automatically upscaled (3x with sharpening) before OCR to improve detection of symbols like @ and . that Tesseract misses at low resolution.
When AI-assisted review is enabled, any additional entities the AI finds are mapped back to pixel bounding boxes and patched onto the already-redacted image. This closes the gap between text-level AI detection and pixel-level redaction.
Four modes
| Mode | Text | Files | Reversible? |
|---|---|---|---|
| Redact | [REDACTED] replaces PII | Pixels blacked out / text redacted in-place | No |
| Tokenize | <PERSON_001> replaces PII | File redacted + token mapping returned | Yes (via detokenize) |
| Block | Rejected if PII found | Rejected if PII found | N/A |
| Flag | Findings only, content unchanged | Findings only, file unchanged | N/A |
Redact
{"content": "Patient John Smith, SSN 123-45-6789", "mode": "redact"}
// → "Patient [REDACTED], SSN [REDACTED]"Irreversible. Best for production pipelines where PII must never reach downstream systems.
Tokenize
{"content": "Patient John Smith, SSN 123-45-6789", "mode": "tokenize"}
// → "Patient <NER_PERSON_001>, SSN <US_SSN_001>"
// → token_mapping: {"<NER_PERSON_001>": "John Smith", "<US_SSN_001>": "123-45-6789"}Reversible via the /sanitize/detokenize endpoint. Best for LLM pipelines where a human reviewer needs to see originals later.
For files, tokenize produces a redacted file for download (pixels blacked out, same as redact) plus the token mapping in the response. You can't put <PERSON_001> into pixels — so the file is visually redacted, and the mapping provides reversibility.
Block
Returns blocked: true with empty output if any PII is found. Best for hard compliance boundaries.
Flag
Scan and report only. Content and files are returned unchanged. Best for monitoring and visibility.
Policy packs
Policy packs control which detection libraries run. Every pack includes pii and credentials. Healthcare is opt-in.
| Pack | Libraries | Best for |
|---|---|---|
default | pii, credentials, prompt_injection | General SaaS, fintech, any industry |
hipaa | pii, credentials, prompt_injection, healthcare | Healthcare, pharma, insurance |
pci | pii, credentials | Payment processing, e-commerce |
legal | pii, credentials | Legal tech, document review |
What default catches
Names, emails, phone numbers, SSNs, passports, IBANs, credit cards (Luhn-validated), IPv4/6 addresses, AWS keys, GitHub/GitLab PATs, Slack tokens, Stripe keys, JWTs, PEM private keys, prompt injection attempts, and more.
This is what every customer gets. No healthcare context needed.
What hipaa adds
HIPAA is an optional add-on for healthcare customers. If you're not handling Protected Health Information, use default — it already catches all standard PII and credentials.
The hipaa pack adds the healthcare detection library on top of everything in default:
| Entity | Detection method | Severity |
|---|---|---|
| Medical Record Number (MRN) | Regex: MRN-, MR#, PAT-, Patient ID: | critical |
| NPI (National Provider Identifier) | Regex + Luhn checksum validation | critical |
| DEA number | Regex + DEA checksum validation | critical |
| ICD-10 diagnosis code | Regex: A00.0-Z99.9 format | warning |
| Drug / medication name | Dictionary: 200+ common prescriptions | warning |
| VIN (Vehicle ID Number) | Regex | warning |
| Account number patterns | Regex | warning |
These patterns implement HIPAA Safe Harbor de-identification — the method specified in 45 CFR 164.514(b)(2) for removing all 18 identifier categories.
File sanitization
Upload images (JPEG, PNG, TIFF, BMP, GIF), PDFs, or DICOM medical images. Max file size: 50 MB.
File types are detected by magic bytes (not Content-Type headers) to prevent spoofing.
Images
Detection pipeline:
- Small images are upscaled (3x + sharpen) for better OCR accuracy
- Unified Presidio
ImageRedactorEnginewith all Aira recognizers performs OCR, entity detection, and pixel-level redaction in one pass - If AI review is enabled, additional AI-discovered entities are mapped back to OCR word bounding boxes and patched onto the image
Why upscaling matters: At 600x300 pixels, email text is ~8px tall. Tesseract reads maria.schmidt@hospital.de as mariaschmidt@hosptalde (18% confidence — the @ and . are noise). At 1800x900 (3x), confidence jumps to 67% and the email is detected correctly.
PDFs
PyMuPDF performs in-place redaction — search_for() locates entity text on each page, add_redact_annot() + apply_redactions() covers it with a solid black fill. Original document structure (fonts, layout, images, tables, signatures) is preserved.
DICOM medical images
Full PS3.15 Annex E de-identification profile:
- 40+ DICOM tags removed or replaced (PatientName, PatientID, DOB, referring physician, institution, accession numbers, UIDs)
- Optional pixel redaction for burned-in text using Presidio's
DicomImageRedactorEngine - Tag-level audit trail in the response (which tags were removed, with hashed original values)
Downloads
Redacted files are stored temporarily via a one-time download token. Tokens expire after 1 hour, are single-use, and are backed by Redis in production.
AI-assisted review
Pass any model registered in your Aira instance — Claude, GPT, Gemini, Bedrock, Azure, Vertex, or self-hosted via BYOM.
The AI receives the original text plus all deterministic findings, and looks for what pattern matching missed:
- OCR errors: "Emait mariaschmidt@hosptalde" → AI recognizes this as an email from context
- Implied conditions: "taking metformin daily" → implies diabetes
- Partial addresses: "lives near the Charite campus" → location identifier
- Context-dependent PII: "my mother's maiden name is..." → indirect identifier
- Non-standard formats: dates, IDs, and references in unusual notation
The AI is additive: it can only add findings, never remove deterministic results.
For files, AI-discovered entities are patched back into the pixel redaction — the system maps AI spans to OCR word bounding boxes and blacks them out.
Dry-run / test mode
Use POST /api/v1/sanitize/test to run sanitization without audit logging, receipts, or database writes. Same response format. Use this to test policy packs and AI review before production.
Detection layers explained
Layer 1: Regex patterns
Deterministic, zero-latency. Catches anything with a known format:
- PII: SSN, email, phone (E.164), credit card (Luhn), IBAN, passport, IPv4/6
- Credentials: AWS keys (AKIA/ASIA), GitHub/GitLab PATs, Slack tokens, Stripe keys, Google API keys, JWTs, PEM private keys, basic-auth URLs, Azure storage keys
- Prompt injection: jailbreak attempts, role-switch markers, system-prompt exfiltration, encoded payload markers
- Healthcare (HIPAA only): MRN, NPI (Luhn), DEA (checksum), ICD-10, drug names, VIN
Layer 2: Presidio NER
Microsoft Presidio with spaCy NER model (en_core_web_lg). Catches unstructured data that regex can't: person names, addresses, dates of birth, organizations, locations.
| Entity | Severity | Action |
|---|---|---|
| US_SSN, CREDIT_CARD, IBAN_CODE, US_PASSPORT, MEDICAL_LICENSE | critical | Redacted |
| PERSON, EMAIL, PHONE, LOCATION, DATE_TIME, ORGANIZATION | warning | Redacted |
| IP_ADDRESS, URL, NRP (nationality/religion/political) | info | Flagged only |
Info-severity entities are never redacted — this prevents false positives like "HDL Cholesterol" being blacked out when detected as ORGANIZATION.
Layer 3: AI review (optional)
LLM second pass. Sees original text + all findings from layers 1-2. Returns additional entities with exact text spans and reasoning. Supports all Aira model providers.
Severity levels
| Severity | Meaning | Redact/Tokenize | Block | Flag |
|---|---|---|---|---|
| critical | Direct identifier (SSN, credit card, MRN) | Replaced | Rejected | Reported |
| warning | Quasi-identifier (name, phone, date) | Replaced | Rejected | Reported |
| info | Metadata (IP, URL, org name) | Not replaced | Rejected | Reported |
Sanitize vs Content Scan Policies
Aira has two features that use the same detection engine. Here's when to use which:
| Sanitize API | Content Scan Policy | |
|---|---|---|
| What it is | Standalone API — send content, get clean output | Policy mode in the authorization engine |
| When it runs | On demand, when you call /api/v1/sanitize | Automatically, during authorize() on an action |
| Input | Text string or uploaded file | Action's details field |
| Output | Cleaned content + findings + download token | Policy verdict (allow/deny/require_approval) |
| Modes | Redact, tokenize, block, flag | Deny (critical), require_approval (warning), allow (info) |
| File support | Images, PDFs, DICOM | Text only |
| AI review | Optional (any model) | No |
| Use case | "Clean this content before storing it" | "Block this action if it contains PII" |
| Example | De-identify a patient record before sending to analytics | Block an AI agent from leaking a credit card in its response |
Use Sanitize when you need to clean content — remove PII from documents, de-identify files, tokenize data for LLM processing.
Use Content Scan Policies when you need to gate actions — block an AI agent from executing if its input/output contains sensitive data.
They share the same regex patterns, NER models, and severity mapping. Content scan policies are just the scanner wired into the policy engine's authorize flow.
Audit logging
Every sanitize call produces structured audit events with no sensitive content logged:
| Event | When | Fields |
|---|---|---|
sanitize_file.start | Request received | file_type, file_size, input_hash, mode, policy |
sanitize_file.scan_complete | Detection finished | findings_count |
sanitize_file.redaction_complete | File modified | file_type |
sanitize_file.ai_pixel_patch | AI entities patched to pixels | extra_words_redacted |
sanitize_file.complete | Response ready | input_hash, output_hash, findings_count, blocked |
sanitize_file.download | File downloaded | token, file_type, size |
Filenames are hashed (SHA-256) in logs — they may contain PHI (e.g., john-smith-mrn-123.pdf).
Code Governance
Enforce governance rules on every pull request — AI agent safety, HIPAA, SOX, legal compliance, and custom policies. Inline comments on exact lines, multi-model consensus, automatic verification.
Content Scan Policies
Regex pattern libraries that catch PII, leaked credentials, and prompt injection attempts before the action executes. No LLM call, no extra latency.