Output filtering
Scan what your agents actually output — flag leaks, deny bad receipts, or redact matched spans before they're hashed into the signed payload.
Aira already blocks bad inputs with content-scan policies. Phase 2 extends the same regex infrastructure to the outcome the agent reports at notarize time. That closes the "prompt-injected credentials leaked through a tool response" gap.
What gets scanned
The outcome_details string (or dict) you pass to notarize() is
scanned against the same pattern libraries used for input:
credentials— AWS access keys, GitHub PATs, GitLab PATs, Slack tokens, Stripe secrets, Google API keys, JWTs, private key PEMs, genericKEY=VALUEassignments, basic-auth URLs.pii— emails, phone numbers, IPv4/IPv6, SSNs, passports, IBANs, credit cards (Luhn-validated).prompt_injection— role-switch attempts, jailbreak markers, exfiltration intent, encoded payload markers, tool-override attempts — in case the output is what the next agent ingests.
Three modes
Configured per-org. All three produce the same output_scan_flags
blob on the receipt; they differ in what else they do.
flag (default)
Scan the outcome, record every hit on the receipt, don't block
anything. The receipt is still signed — output_scan_flags.decision
tells you whether a human should look at it.
Use when: you want visibility without changing runtime behaviour.
deny
If the worst-severity hit meets the org's deny_severity_threshold
(default critical), the notarize call returns 422 with
code: "OUTPUT_SCAN_VIOLATION". The action stays in its
pre-notarize state (authorized or approved) — no receipt is
minted, no transaction is committed. The caller can retry with a
clean outcome.
Use when: the org's compliance posture says "a receipt covering a leaked credential is worse than no receipt at all."
redact
Before the outcome is hashed, matched spans are replaced with
[REDACTED]. The receipt signs over the cleaned bytes. The raw
matched fragment never reaches the receipt row, never reaches the
signed canonical JSON, never reaches logs.
Use when: you want a full audit trail AND you can't tolerate the leaked content ending up in downstream systems that read the hash or the signature.
Note: the hash of the cleaned bytes will not match the hash of the original outcome. If your verification flow compares outcome hashes against a separate source of truth, that system has to see the same redacted outcome.
Configure
Dashboard
Settings → Output Filtering (admin-only). Toggle enabled, pick the mode, check libraries, pick severity thresholds, save.
SDK
from aira import Aira
client = Aira(api_key="aira_live_...")
# Read current policy
policy = client.get_output_policy()
print(policy.mode, policy.libraries)
# Switch to deny mode at the critical threshold
client.update_output_policy(
mode="deny",
deny_severity_threshold="critical",
libraries=["credentials", "pii"],
)import { Aira } from "aira-sdk";
const aira = new Aira({ apiKey: "aira_live_..." });
const policy = await aira.getOutputPolicy();
await aira.updateOutputPolicy({
mode: "redact",
redact_severity_threshold: "warning",
});curl -X PATCH https://api.airaproof.com/api/v1/output-policies \
-H "Authorization: Bearer $AIRA_API_KEY" \
-H "Content-Type: application/json" \
-d '{"mode": "deny", "deny_severity_threshold": "critical"}'PATCH merges — omitted fields are preserved. You can't partially
update the libraries list (that field is replaced atomically when
supplied); send the full desired list.
What's on the receipt
After a notarize under any mode (other than disabled), the
receipt row carries output_scan_flags:
{
"scanned_at": "2026-04-15T12:00:00Z",
"libraries": ["pii", "credentials", "prompt_injection"],
"mode": "deny",
"decision": "deny",
"worst_severity": "critical",
"hits": [
{
"name": "aws_access_key",
"library": "credentials",
"severity": "critical",
"description": "AWS access key ID",
"matches": 1,
"sample": "[REDACTED]"
}
]
}sample is always "[REDACTED]" — even in flag mode where
nothing was actually redacted in the outcome. The hit metadata is
enough for a reviewer to find the action without exposing the raw
secret.
This blob is part of the signed canonical JSON, so a verifier reproducing the signature will fail if the flags were tampered with after notarize.
What gets blocked vs. logged
| Mode | Receipt minted? | Outcome hashed over |
|---|---|---|
flag | Yes, with flags | Raw outcome |
deny (threshold met) | No — 422 to caller | N/A |
deny (threshold not met) | Yes, with flags | Raw outcome |
redact | Yes, with flags | Cleaned outcome |
| (output filtering disabled) | Yes | Raw outcome, output_scan_flags = null |
Gotchas
- Redact invalidates raw-outcome hash comparisons. If you have a
separate pipeline that hashes the raw outcome and compares to the
receipt's
outcome_hash, switch that pipeline to the cleaned outcome or move toflagmode. - Deny threshold too low = agent failures. Set
deny_severity_threshold = "info"and any random email address in the outcome fails notarize. Start atcriticaland walk down if you want tighter coverage. - Prompt-injection patterns fire on the output too — sometimes
intentionally, when the agent is summarizing a suspicious input.
Use per-library enabling to exclude
prompt_injectionfrom outcome scans if that creates noise. - Disabled != not configured. A fresh org has an empty column,
which the service merges with defaults — output filtering is ON by
default in
flagmode. Setenabled: falseto truly opt out.
Feature flag
The whole endpoint group is gated by ENABLE_OUTPUT_FILTERING on
the backend. When that flag is off globally, the notarize hook
skips the scan, the /output-policies routes return 404, and
receipts carry output_scan_flags = null. Useful for self-hosted
deployments that don't need this feature.
Content Scan Policies
Regex pattern libraries that catch PII, leaked credentials, and prompt injection attempts before the action executes. No LLM call, no extra latency.
Compliance Bundles (EU AI Act Art 12)
Cryptographic snapshot of every receipt in a period — what an auditor verifies offline. Merkle-rooted, signed, framework-mapped, self-contained JSON.