Aira

Sanitize

Standalone content sanitization API — scan text and files for PII, PHI, credentials, and healthcare entities. Redact, tokenize, block, or flag. Works with any model.

Aira Sanitize is a standalone content-cleaning API. Send text or upload a file, pick a mode, get back clean output with a full findings report. No policy engine, no action workflow — just scan and clean.

Sanitize vs Content Scan Policies — these are different features that share the same detection engine. See which one to use below.


How it works

Every sanitize call runs through a multi-layer detection pipeline:

Input (text or file)
        |
   [1. Regex patterns]      — SSN, credit card, API keys, MRN, NPI
        |
   [2. Presidio NER]        — person names, addresses, dates, organizations
        |
   [3. Healthcare detectors] — ICD-10, drug names, DEA (HIPAA policy only)
        |
   [4. AI review (optional)] — LLM catches what deterministic scanners miss
        |
   Merged + deduplicated spans
        |
   [Mode: redact / tokenize / block / flag]
        |
   Output + findings report

All layers run in a single pass. Results are merged by position — if regex and NER both find the same SSN, the higher-confidence match wins.

For files (images, PDFs, DICOM)

Image redaction uses a unified Presidio pipeline with all Aira recognizers registered as custom PatternRecognizer objects. Presidio handles OCR, entity detection, span-to-pixel mapping, and redaction in one pass.

Small images are automatically upscaled (3x with sharpening) before OCR to improve detection of symbols like @ and . that Tesseract misses at low resolution.

When AI-assisted review is enabled, any additional entities the AI finds are mapped back to pixel bounding boxes and patched onto the already-redacted image. This closes the gap between text-level AI detection and pixel-level redaction.


Four modes

ModeTextFilesReversible?
Redact[REDACTED] replaces PIIPixels blacked out / text redacted in-placeNo
Tokenize<PERSON_001> replaces PIIFile redacted + token mapping returnedYes (via detokenize)
BlockRejected if PII foundRejected if PII foundN/A
FlagFindings only, content unchangedFindings only, file unchangedN/A

Redact

{"content": "Patient John Smith, SSN 123-45-6789", "mode": "redact"}
// → "Patient [REDACTED], SSN [REDACTED]"

Irreversible. Best for production pipelines where PII must never reach downstream systems.

Tokenize

{"content": "Patient John Smith, SSN 123-45-6789", "mode": "tokenize"}
// → "Patient <NER_PERSON_001>, SSN <US_SSN_001>"
// → token_mapping: {"<NER_PERSON_001>": "John Smith", "<US_SSN_001>": "123-45-6789"}

Reversible via the /sanitize/detokenize endpoint. Best for LLM pipelines where a human reviewer needs to see originals later.

For files, tokenize produces a redacted file for download (pixels blacked out, same as redact) plus the token mapping in the response. You can't put <PERSON_001> into pixels — so the file is visually redacted, and the mapping provides reversibility.

Block

Returns blocked: true with empty output if any PII is found. Best for hard compliance boundaries.

Flag

Scan and report only. Content and files are returned unchanged. Best for monitoring and visibility.


Policy packs

Policy packs control which detection libraries run. Every pack includes pii and credentials. Healthcare is opt-in.

PackLibrariesBest for
defaultpii, credentials, prompt_injectionGeneral SaaS, fintech, any industry
hipaapii, credentials, prompt_injection, healthcareHealthcare, pharma, insurance
pcipii, credentialsPayment processing, e-commerce
legalpii, credentialsLegal tech, document review

What default catches

Names, emails, phone numbers, SSNs, passports, IBANs, credit cards (Luhn-validated), IPv4/6 addresses, AWS keys, GitHub/GitLab PATs, Slack tokens, Stripe keys, JWTs, PEM private keys, prompt injection attempts, and more.

This is what every customer gets. No healthcare context needed.

What hipaa adds

HIPAA is an optional add-on for healthcare customers. If you're not handling Protected Health Information, use default — it already catches all standard PII and credentials.

The hipaa pack adds the healthcare detection library on top of everything in default:

EntityDetection methodSeverity
Medical Record Number (MRN)Regex: MRN-, MR#, PAT-, Patient ID:critical
NPI (National Provider Identifier)Regex + Luhn checksum validationcritical
DEA numberRegex + DEA checksum validationcritical
ICD-10 diagnosis codeRegex: A00.0-Z99.9 formatwarning
Drug / medication nameDictionary: 200+ common prescriptionswarning
VIN (Vehicle ID Number)Regexwarning
Account number patternsRegexwarning

These patterns implement HIPAA Safe Harbor de-identification — the method specified in 45 CFR 164.514(b)(2) for removing all 18 identifier categories.


File sanitization

Upload images (JPEG, PNG, TIFF, BMP, GIF), PDFs, or DICOM medical images. Max file size: 50 MB.

File types are detected by magic bytes (not Content-Type headers) to prevent spoofing.

Images

Detection pipeline:

  1. Small images are upscaled (3x + sharpen) for better OCR accuracy
  2. Unified Presidio ImageRedactorEngine with all Aira recognizers performs OCR, entity detection, and pixel-level redaction in one pass
  3. If AI review is enabled, additional AI-discovered entities are mapped back to OCR word bounding boxes and patched onto the image

Why upscaling matters: At 600x300 pixels, email text is ~8px tall. Tesseract reads maria.schmidt@hospital.de as mariaschmidt@hosptalde (18% confidence — the @ and . are noise). At 1800x900 (3x), confidence jumps to 67% and the email is detected correctly.

PDFs

PyMuPDF performs in-place redactionsearch_for() locates entity text on each page, add_redact_annot() + apply_redactions() covers it with a solid black fill. Original document structure (fonts, layout, images, tables, signatures) is preserved.

DICOM medical images

Full PS3.15 Annex E de-identification profile:

  • 40+ DICOM tags removed or replaced (PatientName, PatientID, DOB, referring physician, institution, accession numbers, UIDs)
  • Optional pixel redaction for burned-in text using Presidio's DicomImageRedactorEngine
  • Tag-level audit trail in the response (which tags were removed, with hashed original values)

Downloads

Redacted files are stored temporarily via a one-time download token. Tokens expire after 1 hour, are single-use, and are backed by Redis in production.


AI-assisted review

Pass any model registered in your Aira instance — Claude, GPT, Gemini, Bedrock, Azure, Vertex, or self-hosted via BYOM.

The AI receives the original text plus all deterministic findings, and looks for what pattern matching missed:

  • OCR errors: "Emait mariaschmidt@hosptalde" → AI recognizes this as an email from context
  • Implied conditions: "taking metformin daily" → implies diabetes
  • Partial addresses: "lives near the Charite campus" → location identifier
  • Context-dependent PII: "my mother's maiden name is..." → indirect identifier
  • Non-standard formats: dates, IDs, and references in unusual notation

The AI is additive: it can only add findings, never remove deterministic results.

For files, AI-discovered entities are patched back into the pixel redaction — the system maps AI spans to OCR word bounding boxes and blacks them out.

Dry-run / test mode

Use POST /api/v1/sanitize/test to run sanitization without audit logging, receipts, or database writes. Same response format. Use this to test policy packs and AI review before production.


Detection layers explained

Layer 1: Regex patterns

Deterministic, zero-latency. Catches anything with a known format:

  • PII: SSN, email, phone (E.164), credit card (Luhn), IBAN, passport, IPv4/6
  • Credentials: AWS keys (AKIA/ASIA), GitHub/GitLab PATs, Slack tokens, Stripe keys, Google API keys, JWTs, PEM private keys, basic-auth URLs, Azure storage keys
  • Prompt injection: jailbreak attempts, role-switch markers, system-prompt exfiltration, encoded payload markers
  • Healthcare (HIPAA only): MRN, NPI (Luhn), DEA (checksum), ICD-10, drug names, VIN

Layer 2: Presidio NER

Microsoft Presidio with spaCy NER model (en_core_web_lg). Catches unstructured data that regex can't: person names, addresses, dates of birth, organizations, locations.

EntitySeverityAction
US_SSN, CREDIT_CARD, IBAN_CODE, US_PASSPORT, MEDICAL_LICENSEcriticalRedacted
PERSON, EMAIL, PHONE, LOCATION, DATE_TIME, ORGANIZATIONwarningRedacted
IP_ADDRESS, URL, NRP (nationality/religion/political)infoFlagged only

Info-severity entities are never redacted — this prevents false positives like "HDL Cholesterol" being blacked out when detected as ORGANIZATION.

Layer 3: AI review (optional)

LLM second pass. Sees original text + all findings from layers 1-2. Returns additional entities with exact text spans and reasoning. Supports all Aira model providers.


Severity levels

SeverityMeaningRedact/TokenizeBlockFlag
criticalDirect identifier (SSN, credit card, MRN)ReplacedRejectedReported
warningQuasi-identifier (name, phone, date)ReplacedRejectedReported
infoMetadata (IP, URL, org name)Not replacedRejectedReported

Sanitize vs Content Scan Policies

Aira has two features that use the same detection engine. Here's when to use which:

Sanitize APIContent Scan Policy
What it isStandalone API — send content, get clean outputPolicy mode in the authorization engine
When it runsOn demand, when you call /api/v1/sanitizeAutomatically, during authorize() on an action
InputText string or uploaded fileAction's details field
OutputCleaned content + findings + download tokenPolicy verdict (allow/deny/require_approval)
ModesRedact, tokenize, block, flagDeny (critical), require_approval (warning), allow (info)
File supportImages, PDFs, DICOMText only
AI reviewOptional (any model)No
Use case"Clean this content before storing it""Block this action if it contains PII"
ExampleDe-identify a patient record before sending to analyticsBlock an AI agent from leaking a credit card in its response

Use Sanitize when you need to clean content — remove PII from documents, de-identify files, tokenize data for LLM processing.

Use Content Scan Policies when you need to gate actions — block an AI agent from executing if its input/output contains sensitive data.

They share the same regex patterns, NER models, and severity mapping. Content scan policies are just the scanner wired into the policy engine's authorize flow.


Audit logging

Every sanitize call produces structured audit events with no sensitive content logged:

EventWhenFields
sanitize_file.startRequest receivedfile_type, file_size, input_hash, mode, policy
sanitize_file.scan_completeDetection finishedfindings_count
sanitize_file.redaction_completeFile modifiedfile_type
sanitize_file.ai_pixel_patchAI entities patched to pixelsextra_words_redacted
sanitize_file.completeResponse readyinput_hash, output_hash, findings_count, blocked
sanitize_file.downloadFile downloadedtoken, file_type, size

Filenames are hashed (SHA-256) in logs — they may contain PHI (e.g., john-smith-mrn-123.pdf).

On this page