Sanitize

Standalone content sanitization API — scan text and files for PII, PHI, credentials, and healthcare entities. Redact, tokenize, block, or flag. Works with any model.

Aira Sanitize is a standalone content-cleaning API. Send text or upload a file, pick a mode, get back clean output with a full findings report. No policy engine, no action workflow — just scan and clean.

Sanitize vs Content Scan Policies — these are different features that share the same detection engine. See which one to use below.

How it works

Every sanitize call runs through a multi-layer detection pipeline:

Input (text or file)
        |
   [1. Regex patterns]      — SSN, credit card, API keys, MRN, NPI
        |
   [2. Presidio NER]        — person names, addresses, dates, organizations
        |
   [3. Healthcare detectors] — ICD-10, drug names, DEA (HIPAA policy only)
        |
   [4. AI review (optional)] — LLM catches what deterministic scanners miss
        |
   Merged + deduplicated spans
        |
   [Mode: redact / tokenize / block / flag]
        |
   Output + findings report

All layers run in a single pass. Results are merged by position — if regex and NER both find the same SSN, the higher-confidence match wins.

For files (images, PDFs, DICOM)

Image redaction uses a unified Presidio pipeline with all Aira recognizers registered as custom PatternRecognizer objects. Presidio handles OCR, entity detection, span-to-pixel mapping, and redaction in one pass.

Small images are automatically upscaled (3x with sharpening) before OCR to improve detection of symbols like @ and . that Tesseract misses at low resolution.

When AI-assisted review is enabled, any additional entities the AI finds are mapped back to pixel bounding boxes and patched onto the already-redacted image. This closes the gap between text-level AI detection and pixel-level redaction.

Four modes

Mode	Text	Files	Reversible?
Redact	`[REDACTED]` replaces PII	Pixels blacked out / text redacted in-place	No
Tokenize	`<PERSON_001>` replaces PII	File redacted + token mapping returned	Yes (via detokenize)
Block	Rejected if PII found	Rejected if PII found	N/A
Flag	Findings only, content unchanged	Findings only, file unchanged	N/A

Redact

{"content": "Patient John Smith, SSN 123-45-6789", "mode": "redact"}
// → "Patient [REDACTED], SSN [REDACTED]"

Irreversible. Best for production pipelines where PII must never reach downstream systems.

Tokenize

{"content": "Patient John Smith, SSN 123-45-6789", "mode": "tokenize"}
// → "Patient <NER_PERSON_001>, SSN <US_SSN_001>"
// → token_mapping: {"<NER_PERSON_001>": "John Smith", "<US_SSN_001>": "123-45-6789"}

Reversible via the /sanitize/detokenize endpoint. Best for LLM pipelines where a human reviewer needs to see originals later.

For files, tokenize produces a redacted file for download (pixels blacked out, same as redact) plus the token mapping in the response. You can't put <PERSON_001> into pixels — so the file is visually redacted, and the mapping provides reversibility.

Block

Returns blocked: true with empty output if any PII is found. Best for hard compliance boundaries.

Flag

Scan and report only. Content and files are returned unchanged. Best for monitoring and visibility.

Policy packs

Policy packs control which detection libraries run. Every pack includes pii and credentials. Healthcare is opt-in.

Pack	Libraries	Best for
`default`	pii, credentials, prompt_injection	General SaaS, fintech, any industry
`hipaa`	pii, credentials, prompt_injection, healthcare	Healthcare, pharma, insurance
`pci`	pii, credentials	Payment processing, e-commerce
`legal`	pii, credentials	Legal tech, document review

What `default` catches

Names, emails, phone numbers, SSNs, passports, IBANs, credit cards (Luhn-validated), IPv4/6 addresses, AWS keys, GitHub/GitLab PATs, Slack tokens, Stripe keys, JWTs, PEM private keys, prompt injection attempts, and more.

This is what every customer gets. No healthcare context needed.

What `hipaa` adds

HIPAA is an optional add-on for healthcare customers. If you're not handling Protected Health Information, use default — it already catches all standard PII and credentials.

The hipaa pack adds the healthcare detection library on top of everything in default:

Entity	Detection method	Severity
Medical Record Number (MRN)	Regex: MRN-, MR#, PAT-, Patient ID:	critical
NPI (National Provider Identifier)	Regex + Luhn checksum validation	critical
DEA number	Regex + DEA checksum validation	critical
ICD-10 diagnosis code	Regex: A00.0-Z99.9 format	warning
Drug / medication name	Dictionary: 200+ common prescriptions	warning
VIN (Vehicle ID Number)	Regex	warning
Account number patterns	Regex	warning

These patterns implement HIPAA Safe Harbor de-identification — the method specified in 45 CFR 164.514(b)(2) for removing all 18 identifier categories.

File sanitization

Upload images (JPEG, PNG, TIFF, BMP, GIF), PDFs, or DICOM medical images. Max file size: 50 MB.

File types are detected by magic bytes (not Content-Type headers) to prevent spoofing.

Images

Detection pipeline:

Small images are upscaled (3x + sharpen) for better OCR accuracy
Unified Presidio ImageRedactorEngine with all Aira recognizers performs OCR, entity detection, and pixel-level redaction in one pass
If AI review is enabled, additional AI-discovered entities are mapped back to OCR word bounding boxes and patched onto the image

Why upscaling matters: At 600x300 pixels, email text is ~8px tall. Tesseract reads maria.schmidt@hospital.de as mariaschmidt@hosptalde (18% confidence — the @ and . are noise). At 1800x900 (3x), confidence jumps to 67% and the email is detected correctly.

PDFs

PyMuPDF performs in-place redaction — search_for() locates entity text on each page, add_redact_annot() + apply_redactions() covers it with a solid black fill. Original document structure (fonts, layout, images, tables, signatures) is preserved.

DICOM medical images

Full PS3.15 Annex E de-identification profile:

40+ DICOM tags removed or replaced (PatientName, PatientID, DOB, referring physician, institution, accession numbers, UIDs)
Optional pixel redaction for burned-in text using Presidio's DicomImageRedactorEngine
Tag-level audit trail in the response (which tags were removed, with hashed original values)

Downloads

Redacted files are stored temporarily via a one-time download token. Tokens expire after 1 hour, are single-use, and are backed by Redis in production.

AI-assisted review

Pass any model registered in your Aira instance — Claude, GPT, Gemini, Bedrock, Azure, Vertex, or self-hosted via BYOM.

The AI receives the original text plus all deterministic findings, and looks for what pattern matching missed:

OCR errors: "Emait mariaschmidt@hosptalde" → AI recognizes this as an email from context
Implied conditions: "taking metformin daily" → implies diabetes
Partial addresses: "lives near the Charite campus" → location identifier
Context-dependent PII: "my mother's maiden name is..." → indirect identifier
Non-standard formats: dates, IDs, and references in unusual notation

The AI is additive: it can only add findings, never remove deterministic results.

For files, AI-discovered entities are patched back into the pixel redaction — the system maps AI spans to OCR word bounding boxes and blacks them out.

Dry-run / test mode

Use POST /api/v1/sanitize/test to run sanitization without audit logging, receipts, or database writes. Same response format. Use this to test policy packs and AI review before production.

Detection layers explained

Layer 1: Regex patterns

Deterministic, zero-latency. Catches anything with a known format:

PII: SSN, email, phone (E.164), credit card (Luhn), IBAN, passport, IPv4/6
Credentials: AWS keys (AKIA/ASIA), GitHub/GitLab PATs, Slack tokens, Stripe keys, Google API keys, JWTs, PEM private keys, basic-auth URLs, Azure storage keys
Prompt injection: jailbreak attempts, role-switch markers, system-prompt exfiltration, encoded payload markers
Healthcare (HIPAA only): MRN, NPI (Luhn), DEA (checksum), ICD-10, drug names, VIN

Layer 2: Presidio NER

Microsoft Presidio with spaCy NER model (en_core_web_lg). Catches unstructured data that regex can't: person names, addresses, dates of birth, organizations, locations.

Entity	Severity	Action
US_SSN, CREDIT_CARD, IBAN_CODE, US_PASSPORT, MEDICAL_LICENSE	critical	Redacted
PERSON, EMAIL, PHONE, LOCATION, DATE_TIME, ORGANIZATION	warning	Redacted
IP_ADDRESS, URL, NRP (nationality/religion/political)	info	Flagged only

Info-severity entities are never redacted — this prevents false positives like "HDL Cholesterol" being blacked out when detected as ORGANIZATION.

Layer 3: AI review (optional)

LLM second pass. Sees original text + all findings from layers 1-2. Returns additional entities with exact text spans and reasoning. Supports all Aira model providers.

Severity levels

Severity	Meaning	Redact/Tokenize	Block	Flag
critical	Direct identifier (SSN, credit card, MRN)	Replaced	Rejected	Reported
warning	Quasi-identifier (name, phone, date)	Replaced	Rejected	Reported
info	Metadata (IP, URL, org name)	Not replaced	Rejected	Reported

Sanitize vs Content Scan Policies

Aira has two features that use the same detection engine. Here's when to use which:

	Sanitize API	Content Scan Policy
What it is	Standalone API — send content, get clean output	Policy mode in the authorization engine
When it runs	On demand, when you call `/api/v1/sanitize`	Automatically, during `authorize()` on an action
Input	Text string or uploaded file	Action's `details` field
Output	Cleaned content + findings + download token	Policy verdict (allow/deny/require_approval)
Modes	Redact, tokenize, block, flag	Deny (critical), require_approval (warning), allow (info)
File support	Images, PDFs, DICOM	Text only
AI review	Optional (any model)	No
Use case	"Clean this content before storing it"	"Block this action if it contains PII"
Example	De-identify a patient record before sending to analytics	Block an AI agent from leaking a credit card in its response

Use Sanitize when you need to clean content — remove PII from documents, de-identify files, tokenize data for LLM processing.

Use Content Scan Policies when you need to gate actions — block an AI agent from executing if its input/output contains sensitive data.

They share the same regex patterns, NER models, and severity mapping. Content scan policies are just the scanner wired into the policy engine's authorize flow.

Audit logging

Every sanitize call produces structured audit events with no sensitive content logged:

Event	When	Fields
`sanitize_file.start`	Request received	file_type, file_size, input_hash, mode, policy
`sanitize_file.scan_complete`	Detection finished	findings_count
`sanitize_file.redaction_complete`	File modified	file_type
`sanitize_file.ai_pixel_patch`	AI entities patched to pixels	extra_words_redacted
`sanitize_file.complete`	Response ready	input_hash, output_hash, findings_count, blocked
`sanitize_file.download`	File downloaded	token, file_type, size

Filenames are hashed (SHA-256) in logs — they may contain PHI (e.g., john-smith-mrn-123.pdf).

Sanitize

On this page