Sanitize API

Scan, redact, tokenize, or block sensitive content in text and files. Supports PII, PHI, credentials, and healthcare-specific entities.

The Sanitize API is Aira's standalone content-cleaning pipeline. Pass in text or upload a file (image, PDF, DICOM), choose a policy and mode, and get back clean output with a full findings report.

For a conceptual overview, see the Sanitize guide. For the difference between this and content scan policies, see Sanitize vs Content Scan Policies.

Base URL

POST /api/v1/sanitize

All sanitize endpoints require a valid API key or JWT.

Sanitize text

Scan and process a text string.

POST /api/v1/sanitize

Request body

Field	Type	Required	Default	Description
`content`	string	Yes		Text to sanitize (max 500,000 chars)
`policy`	string	No	`"default"`	Policy pack: `default`, `hipaa`, `pci`, `legal`
`mode`	string	No	`"redact"`	One of `redact`, `tokenize`, `block`, `flag`
`ai_model`	string	No	`null`	Model ID for AI-assisted second-pass review

Modes

Mode	Behavior
redact	Replace detected entities with `[REDACTED]`
tokenize	Replace with reversible tokens like `<PERSON_001>`, return a mapping
block	If sensitive content is found, return `blocked: true` with empty output
flag	Scan only — return findings without modifying the content

Policy packs

Pack	Libraries	Use case
`default`	pii, credentials, prompt_injection	General-purpose scanning
`hipaa`	pii, credentials, prompt_injection, healthcare	PHI: MRNs, diagnoses, dates, providers
`pci`	pii, credentials	Card numbers, account data
`legal`	pii, credentials	Names, emails, case-adjacent PII

Example

curl -X POST https://api.airaproof.com/api/v1/sanitize \
  -H "Authorization: Bearer $AIRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Patient John Smith, SSN 123-45-6789, was admitted on 2024-03-15.",
    "policy": "hipaa",
    "mode": "redact"
  }'

from aira import Aira

client = Aira(api_key="aira_live_...")

result = client.sanitize(
    content="Patient John Smith, SSN 123-45-6789, was admitted on 2024-03-15.",
    policy="hipaa",
    mode="redact",
)

print(result.clean)
# Patient [REDACTED], SSN [REDACTED], was admitted on [REDACTED].

print(result.findings)
# [Finding(entity_type='ner_person', severity='warning', count=1), ...]

import { Aira } from "aira-sdk";

const aira = new Aira({ apiKey: "aira_live_..." });

const result = await aira.sanitize({
  content: "Patient John Smith, SSN 123-45-6789, was admitted on 2024-03-15.",
  policy: "hipaa",
  mode: "redact",
});

console.log(result.clean);
// Patient [REDACTED], SSN [REDACTED], was admitted on [REDACTED].

Response

{
  "clean": "Patient [REDACTED], SSN [REDACTED], was admitted on [REDACTED].",
  "blocked": false,
  "mode": "redact",
  "policy": "hipaa",
  "input_hash": "sha256:abc123...",
  "output_hash": "sha256:def456...",
  "findings": [
    {
      "entity_type": "ner_person",
      "severity": "warning",
      "action_taken": "redacted",
      "library": "pii_ner",
      "description": "NER: PERSON (score 0.95)",
      "count": 1
    },
    {
      "entity_type": "us_ssn",
      "severity": "critical",
      "action_taken": "redacted",
      "library": "pii",
      "description": "US Social Security Number",
      "count": 1
    },
    {
      "entity_type": "ner_date_time",
      "severity": "warning",
      "action_taken": "redacted",
      "library": "pii_ner",
      "description": "NER: DATE_TIME (score 0.85)",
      "count": 1
    }
  ],
  "token_mapping": null,
  "request_id": "req_abc123"
}

Tokenize mode response

When mode is "tokenize", the response includes reversible tokens:

{
  "clean": "Patient <NER_PERSON_001>, SSN <US_SSN_001>, was admitted on <NER_DATE_TIME_001>.",
  "token_mapping": {
    "<NER_PERSON_001>": "John Smith",
    "<US_SSN_001>": "123-45-6789",
    "<NER_DATE_TIME_001>": "2024-03-15"
  },
  "findings": [...]
}

Sanitize text (test mode)

Dry-run sanitization with no audit logging, no receipts, no database writes. Use this for testing policies before production.

POST /api/v1/sanitize/test

Same request/response shape as POST /api/v1/sanitize, minus the ai_model field.

Detokenize

Reverse tokenization — restore original entities from a token mapping returned by a previous tokenize call.

POST /api/v1/sanitize/detokenize

Request body

Field	Type	Required	Description
`content`	string	Yes	Tokenized text to reverse
`token_mapping`	object	Yes	Token-to-original mapping from the sanitize response

Example

curl -X POST https://api.airaproof.com/api/v1/sanitize/detokenize \
  -H "Authorization: Bearer $AIRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Patient <NER_PERSON_001>, SSN <US_SSN_001>.",
    "token_mapping": {
      "<NER_PERSON_001>": "John Smith",
      "<US_SSN_001>": "123-45-6789"
    }
  }'

Response

{
  "content": "Patient John Smith, SSN 123-45-6789.",
  "request_id": "req_def456"
}

Sanitize file

Upload an image, PDF, or DICOM file for scanning and optional redaction.

POST /api/v1/sanitize/file
Content-Type: multipart/form-data

Form fields

Field	Type	Required	Default	Description
`file`	file	Yes		Image (JPEG, PNG, GIF, BMP, TIFF), PDF, or DICOM
`policy`	string	No	`"default"`	Policy pack
`mode`	string	No	`"redact"`	`redact`, `tokenize`, `block`, `flag`
`ai_model`	string	No	`null`	Model ID for AI-assisted review
`include_pixel_redaction`	bool	No	`false`	Enable pixel-level redaction for images and DICOM

Max file size: 50 MB

File type detection

Files are identified by magic bytes, not the Content-Type header or extension. This prevents file-type spoofing.

Type	Magic signature
JPEG	`\xff\xd8\xff`
PNG	`\x89PNG\r\n\x1a\n`
PDF	`%PDF`
DICOM	`DICM` at byte offset 128
GIF	`GIF87a` / `GIF89a`
BMP	`BM`
TIFF	`II\0` / `MM\0`

Mode behavior per file type

Mode	Image	PDF	DICOM
redact	Presidio pixel-level redaction	PyMuPDF in-place redaction	PS3.15 Annex E metadata + optional pixel redaction
tokenize	Pixel redaction + token mapping	In-place redaction + token mapping	PS3.15 de-identification + token mapping
block	Reject if PII found	Reject if PII found	Reject if PII found
flag	Scan only, return findings	Scan only, return findings	Scan only, return findings

Example

curl -X POST https://api.airaproof.com/api/v1/sanitize/file \
  -H "Authorization: Bearer $AIRA_API_KEY" \
  -F "file=@patient-record.pdf" \
  -F "policy=hipaa" \
  -F "mode=redact"

Response

{
  "file_type": "pdf",
  "original_filename": "patient-record.pdf",
  "findings": [
    {
      "entity_type": "ner_person",
      "severity": "warning",
      "action_taken": "redacted",
      "library": "pii_ner",
      "description": "NER: PERSON (score 0.92)",
      "count": 3
    },
    {
      "entity_type": "us_ssn",
      "severity": "critical",
      "action_taken": "redacted",
      "library": "pii",
      "description": "US Social Security Number",
      "count": 1
    }
  ],
  "blocked": false,
  "mode": "redact",
  "policy": "hipaa",
  "input_hash": "sha256:abc...",
  "output_hash": "sha256:def...",
  "download_token": "a1b2c3d4...",
  "download_url": "https://api.airaproof.com/api/v1/sanitize/file/a1b2c3d4.../download",
  "tokenized_text": "Patient [REDACTED]\\nSSN: [REDACTED]\\n...",
  "token_mapping": null,
  "dicom_tag_actions": null,
  "pixel_redactions": null,
  "request_id": "req_xyz789"
}

DICOM response (redact mode)

DICOM files include additional metadata about de-identification:

{
  "file_type": "dicom",
  "dicom_tag_actions": [
    { "tag_name": "PatientName", "tag_number": "(0010,0010)", "action": "removed", "original_hash": "sha256:..." },
    { "tag_name": "PatientID", "tag_number": "(0010,0020)", "action": "removed", "original_hash": "sha256:..." },
    { "tag_name": "PatientBirthDate", "tag_number": "(0010,0030)", "action": "removed", "original_hash": "sha256:..." }
  ],
  "pixel_redactions": [
    { "text_found": "John Smith", "bounding_box": [100, 50, 250, 80], "confidence": 0.91 }
  ]
}

Download sanitized file

Download the cleaned file using the one-time token from the sanitize response.

GET /api/v1/sanitize/file/{token}/download

No authentication required — the token itself is the authorization.

Tokens expire after 1 hour
Tokens are single-use — the file is deleted after the first download
The response filename is sanitized_<original_name>.<ext>
The Content-Type header matches the original file type

Example

curl -o sanitized-record.pdf \
  https://api.airaproof.com/api/v1/sanitize/file/a1b2c3d4.../download

Findings object

Every sanitize response includes a findings array:

Field	Type	Description
`entity_type`	string	What was detected (e.g., `us_ssn`, `ner_person`, `mrn_pattern`)
`severity`	string	`critical`, `warning`, or `info`
`action_taken`	string	`redacted`, `tokenized`, `blocked`, `flagged`
`library`	string	Which scanner found it (`pii`, `credentials`, `pii_ner`, `healthcare`)
`description`	string	Human-readable description
`count`	integer	How many instances of this entity type were found

Severity behavior

Severity	Redact/Tokenize	Block	Flag
`critical`	Replaced	Blocked	Flagged
`warning`	Replaced	Blocked	Flagged
`info`	Not replaced (flagged only)	Blocked	Flagged

Info-severity entities (IP addresses, URLs, organization names) are reported in findings but never redacted or tokenized. This prevents false positives like "HDL Cholesterol" being blacked out.

Error responses

Status	Code	Description
413	—	File exceeds 50 MB
415	—	Unsupported file type
422	—	Empty file, corrupt file, or missing dependency
422	`OUTPUT_SCAN_VIOLATION`	Block mode triggered

Sanitize API

On this page