Output filtering

Scan what your agents actually output — flag leaks, deny bad receipts, or redact matched spans before they're hashed into the signed payload.

Aira already blocks bad inputs with content-scan policies. Phase 2 extends the same regex infrastructure to the outcome the agent reports at notarize time. That closes the "prompt-injected credentials leaked through a tool response" gap.

What gets scanned

The outcome_details string (or dict) you pass to notarize() is scanned against the same pattern libraries used for input:

credentials — AWS access keys, GitHub PATs, GitLab PATs, Slack tokens, Stripe secrets, Google API keys, JWTs, private key PEMs, generic KEY=VALUE assignments, basic-auth URLs.
pii — emails, phone numbers, IPv4/IPv6, SSNs, passports, IBANs, credit cards (Luhn-validated).
prompt_injection — role-switch attempts, jailbreak markers, exfiltration intent, encoded payload markers, tool-override attempts — in case the output is what the next agent ingests.

Three modes

Configured per-org. All three produce the same output_scan_flags blob on the receipt; they differ in what else they do.

`flag` (default)

Scan the outcome, record every hit on the receipt, don't block anything. The receipt is still signed — output_scan_flags.decision tells you whether a human should look at it.

Use when: you want visibility without changing runtime behaviour.

`deny`

If the worst-severity hit meets the org's deny_severity_threshold (default critical), the notarize call returns 422 with code: "OUTPUT_SCAN_VIOLATION". The action stays in its pre-notarize state (authorized or approved) — no receipt is minted, no transaction is committed. The caller can retry with a clean outcome.

Use when: the org's compliance posture says "a receipt covering a leaked credential is worse than no receipt at all."

`redact`

Before the outcome is hashed, matched spans are replaced with [REDACTED]. The receipt signs over the cleaned bytes. The raw matched fragment never reaches the receipt row, never reaches the signed canonical JSON, never reaches logs.

Use when: you want a full audit trail AND you can't tolerate the leaked content ending up in downstream systems that read the hash or the signature.

Note: the hash of the cleaned bytes will not match the hash of the original outcome. If your verification flow compares outcome hashes against a separate source of truth, that system has to see the same redacted outcome.

Configure

Dashboard

Settings → Output Filtering (admin-only). Toggle enabled, pick the mode, check libraries, pick severity thresholds, save.

SDK

from aira import Aira

client = Aira(api_key="aira_live_...")

# Read current policy
policy = client.get_output_policy()
print(policy.mode, policy.libraries)

# Switch to deny mode at the critical threshold
client.update_output_policy(
    mode="deny",
    deny_severity_threshold="critical",
    libraries=["credentials", "pii"],
)

import { Aira } from "aira-sdk";

const aira = new Aira({ apiKey: "aira_live_..." });

const policy = await aira.getOutputPolicy();
await aira.updateOutputPolicy({
  mode: "redact",
  redact_severity_threshold: "warning",
});

curl -X PATCH https://api.airaproof.com/api/v1/output-policies \
  -H "Authorization: Bearer $AIRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"mode": "deny", "deny_severity_threshold": "critical"}'

PATCH merges — omitted fields are preserved. You can't partially update the libraries list (that field is replaced atomically when supplied); send the full desired list.

What's on the receipt

After a notarize under any mode (other than disabled), the receipt row carries output_scan_flags:

{
  "scanned_at": "2026-04-15T12:00:00Z",
  "libraries": ["pii", "credentials", "prompt_injection"],
  "mode": "deny",
  "decision": "deny",
  "worst_severity": "critical",
  "hits": [
    {
      "name": "aws_access_key",
      "library": "credentials",
      "severity": "critical",
      "description": "AWS access key ID",
      "matches": 1,
      "sample": "[REDACTED]"
    }
  ]
}

sample is always "[REDACTED]" — even in flag mode where nothing was actually redacted in the outcome. The hit metadata is enough for a reviewer to find the action without exposing the raw secret.

This blob is part of the signed canonical JSON, so a verifier reproducing the signature will fail if the flags were tampered with after notarize.

What gets blocked vs. logged

Mode	Receipt minted?	Outcome hashed over
`flag`	Yes, with flags	Raw outcome
`deny` (threshold met)	No — 422 to caller	N/A
`deny` (threshold not met)	Yes, with flags	Raw outcome
`redact`	Yes, with flags	Cleaned outcome
(output filtering disabled)	Yes	Raw outcome, `output_scan_flags = null`

Gotchas

Redact invalidates raw-outcome hash comparisons. If you have a separate pipeline that hashes the raw outcome and compares to the receipt's outcome_hash, switch that pipeline to the cleaned outcome or move to flag mode.
Deny threshold too low = agent failures. Set deny_severity_threshold = "info" and any random email address in the outcome fails notarize. Start at critical and walk down if you want tighter coverage.
Prompt-injection patterns fire on the output too — sometimes intentionally, when the agent is summarizing a suspicious input. Use per-library enabling to exclude prompt_injection from outcome scans if that creates noise.
Disabled != not configured. A fresh org has an empty column, which the service merges with defaults — output filtering is ON by default in flag mode. Set enabled: false to truly opt out.

Feature flag

The whole endpoint group is gated by ENABLE_OUTPUT_FILTERING on the backend. When that flag is off globally, the notarize hook skips the scan, the /output-policies routes return 404, and receipts carry output_scan_flags = null. Useful for self-hosted deployments that don't need this feature.

Output filtering

On this page