Aira

Content Scan Policies

Regex pattern libraries that catch PII, leaked credentials, and prompt injection attempts before the action executes. No LLM call, no extra latency.

What it is

content_scan is the fourth Aira policy mode, alongside rules, ai, and consensus. It runs in-process before the action executes and matches the action's details field against curated pattern libraries plus optional org-specific custom regex.

Looking for the Sanitize API? Content scan policies block actions in the authorization flow. If you need to clean content (redact PII from text, de-identify files, tokenize data), use the Sanitize API instead. See Sanitize vs Content Scan Policies for a full comparison.

Severity decides the verdict:

SeverityVerdict
criticalDeny — the action is blocked
warningRequire approval — the action is held for human review
infoAllow — the hit is logged but the action proceeds

The scanner combines regex patterns with Presidio NER (Named Entity Recognition) for context-aware detection. It never logs the matched secret in plaintext — every hit is redacted to first/last 4 characters before it touches an audit row, a webhook payload, or a UI surface.

For standalone content sanitization (redact, tokenize, block, flag) outside the policy engine, see the Sanitize guide.


Built-in libraries

pii

PatternSeverity
US Social Security Numbercritical
IBAN bank account numbercritical
US passport numbercritical
Credit card (Luhn-checked)critical
Email addresswarning
International phone numberwarning
IPv4 addressinfo
IPv6 addressinfo

Credit cards are filtered through a Luhn checksum, so 4111111111111112 (a real-looking but invalid number) does not match. This eliminates the most common false-positive in PII scanning.

credentials

PatternSeverity
AWS access key id (AKIA/ASIA/AROA/AIDA...)critical
AWS secret key candidate (40-char base64)critical
GitHub PAT (ghp_/gho_/ghs_/ghu_/ghr_)critical
GitLab PAT (glpat-)critical
Slack token (xoxa-/xoxb-/xoxp-/xoxr-/xoxs-)critical
Stripe secret key (sk_live_/sk_test_/rk_live_/rk_test_)critical
Google API key (AIza...)critical
Azure storage account keycritical
PEM private key headercritical
Basic-auth URL (postgres://user:pass@...)critical
JSON Web Token (eyJ...)warning
Generic api_key/secret/password/token assignmentwarning

prompt_injection

PatternSeverity
Ignore-previous-instructions style overridewarning
Role-switch / jailbreak ("you are now...")warning
Embedded system: / assistant: role markerswarning
DAN / dev-mode / sudo mode invocationscritical
System-prompt or secret exfiltration attemptswarning
Encoded payload smuggling markers ("base64 decode...")info
Tool elevation attempts ("call X as admin")warning

The pattern set is curated to minimize false positives on benign user content. If you need stricter matching, use the custom patterns field on the policy.

healthcare (add-on)

Available with the hipaa policy pack in the Sanitize API.

PatternSeverity
Medical Record Number (MRN-, MR#, PAT-, Patient ID:)critical
ICD-10 diagnosis codewarning
NPI (National Provider Identifier, Luhn-validated)critical
DEA number (checksum-validated)critical
Drug name (200+ common prescriptions)warning
VIN (Vehicle Identification Number)warning
Account number patternswarning

NER (Named Entity Recognition)

In addition to regex, the pii library enables Presidio NER — spaCy-powered entity detection that catches names, addresses, dates of birth, organizations, and other entities that require natural language understanding. NER entities are prefixed with ner_ (e.g., ner_person, ner_location).

NER entities detected as info severity (IP addresses, URLs, nationality/religion/political group) are flagged but never redacted to prevent false positives on benign content.


Create a content_scan policy

Dashboard

Dashboard → Policies → New policy. Pick the Content scan mode. Toggle the libraries you want, optionally add custom regex rows, then save.

API

curl -X POST https://api.airaproof.com/api/v1/policies \
  -H "Authorization: Bearer aira_live_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Block PII in customer messages",
    "mode": "content_scan",
    "priority": 100,
    "scan_config": {
      "libraries": ["pii", "credentials"],
      "custom_patterns": [
        {
          "name": "internal_code",
          "regex": "INTERNAL-CODE-\\d+",
          "severity": "critical",
          "description": "Org-internal classification codes"
        }
      ]
    },
    "decision": "deny"
  }'

scan_config requires at least one library OR one custom pattern; an empty config is rejected with HTTP 422.


How it integrates with the policy engine

A content_scan policy is just another row in the priority-ordered policy list. When authorize() runs:

  1. The action's details field (or its JSON serialization, if it's a dict) is fed through the scanner.
  2. Every pattern in every enabled library plus every custom regex is matched.
  3. The worst severity in the hit list decides the verdict (critical → deny, warning → require_approval, info → allow).
  4. The scanner's hit list (with redacted samples) is persisted in the PolicyEvaluation.model_votes JSON column so the dashboard can render the evaluation later without re-scanning.

A deny from a high-priority content_scan policy stops the rest of the policy chain, exactly like a deny from any other mode.


What gets logged

For an SSN match in a customer message, the persisted model_votes looks like:

{
  "scan": [
    {
      "name": "us_ssn",
      "library": "pii",
      "severity": "critical",
      "matches": 1,
      "sample": "123...6789"
    }
  ],
  "worst_severity": "critical"
}

The sample is redacted. The full secret never appears in the database, in audit rows, in webhook payloads, or in the dashboard. There is no flag to disable redaction — if you need the full plaintext for forensics, you have to fetch it from the original action's details_storage_key (which is itself encrypted at rest).


When to use content_scan vs ai vs rules

Use caseRecommended mode
"Block any action that contains an AWS key"content_scan (zero latency, deterministic)
"Block actions involving customer PII or financial data over €5,000"ai (context-dependent judgement)
"Block every wire transfer above 10k"rules (deterministic field match)
"Multi-model vote on a loan decision"consensus

content_scan is the right answer whenever the rule is "this byte sequence must never appear." It is faster and cheaper than ai mode, deterministic, and produces auditable output without an LLM round-trip.

On this page