Custom Models

Overview

For AWS Bedrock, Azure OpenAI, and Google Vertex AI, use Provider Credentials instead.

Custom models (BYOM — Bring Your Own Model) let you connect any OpenAI-compatible endpoint to Aira. Once registered, custom models can be used in cases and Ask Aira chat alongside built-in models.

Endpoints must accept /v1/chat/completions requests. Models that support OpenAI function calling get full tool-calling support in Ask Aira chat — the same experience as built-in models. Models without tool calling fall back to context-stuffing mode.

See Supported Models → BYOM for verified models and setup instructions.

Self-hosted open-weight models now have dedicated support with per-model system prompts and constrained decoding (vLLM guided_json). See Supported Models for the full list and setup instructions.

Register Model

POST /api/v1/models/custom
Authorization: Bearer aira_live_xxxxx

Request Body

Field	Type	Required	Description
`name`	string	Yes	Display name (1-200 chars)
`model_id`	string	Yes	Unique model identifier (1-100 chars)
`endpoint_url`	string	Yes	HTTPS endpoint URL
`auth_header`	string	No	Authorization header value (e.g. `Bearer sk-...`)
`request_schema`	object	No	Request format (default: `{"type": "openai_compatible"}`)
`response_schema`	object	No	Response parsing (default: `{"type": "openai_compatible", "content_path": "choices[0].message.content"}`)
`timeout_ms`	integer	No	Request timeout in ms (1,000-120,000, default: 30,000)

Example Request

curl -X POST https://api.airaproof.com/api/v1/models/custom \
  -H "Authorization: Bearer aira_live_xxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Our Fine-tuned GPT",
    "model_id": "acme/compliance-v2",
    "endpoint_url": "https://api.acme.com/v1/chat/completions",
    "auth_header": "Bearer sk-acme-abc123...",
    "timeout_ms": 45000
  }'

Response (201 Created)

{
  "id": "cm_01J8X...",
  "name": "Our Fine-tuned GPT",
  "model_id": "acme/compliance-v2",
  "endpoint_url": "https://api.acme.com/v1/chat/completions",
  "has_auth": true,
  "request_schema": { "type": "openai_compatible" },
  "response_schema": { "type": "openai_compatible", "content_path": "choices[0].message.content" },
  "timeout_ms": 45000,
  "status": "untested",
  "last_tested": null,
  "created_at": "2026-03-14T10:30:00Z",
  "request_id": "req_01J8X..."
}

Response Fields

Field	Type	Description
`id`	string	Unique model registration ID
`has_auth`	boolean	Whether an auth header is configured (never exposes the value)
`status`	string	`untested`, `ok`, or `error`
`last_tested`	string\|null	ISO 8601 timestamp of last test

List Models

GET /api/v1/models/custom
Authorization: Bearer aira_live_xxxxx

Returns all custom models registered by your organization.

Get Model

GET /api/v1/models/custom/{id}
Authorization: Bearer aira_live_xxxxx

Returns a single custom model by ID.

Update Model

PUT /api/v1/models/custom/{id}
Authorization: Bearer aira_live_xxxxx

Request Body

All fields are optional — only include fields you want to change.

Field	Type	Description
`name`	string	Updated display name
`endpoint_url`	string	Updated HTTPS endpoint
`auth_header`	string	Updated auth header
`request_schema`	object	Updated request format
`response_schema`	object	Updated response parsing
`timeout_ms`	integer	Updated timeout (1,000-120,000)

Example Request

curl -X PUT https://api.airaproof.com/api/v1/models/custom/cm_01J8X... \
  -H "Authorization: Bearer aira_live_xxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "endpoint_url": "https://api.acme.com/v2/chat/completions",
    "timeout_ms": 60000
  }'

Delete Model

DELETE /api/v1/models/custom/{id}
Authorization: Bearer aira_live_xxxxx

Removes the custom model. Returns 204 No Content.

Test Model

POST /api/v1/models/custom/{id}/test
Authorization: Bearer aira_live_xxxxx

Sends a test request to the model endpoint and validates the response format.

Response (200 OK)

{
  "status": "ok",
  "latency_ms": 1250,
  "response_valid": true,
  "structured_output": {
    "decision": "APPROVE",
    "confidence": 0.85,
    "key_factors": ["test factor"],
    "reasoning": "Test reasoning response"
  },
  "error": null,
  "request_id": "req_01J8X..."
}

If the endpoint is unreachable or returns invalid output:

{
  "status": "error",
  "latency_ms": 30000,
  "response_valid": false,
  "structured_output": null,
  "error": "Request timed out after 30 seconds",
  "request_id": "req_..."
}

Test Response Fields

Field	Type	Description
`status`	string	`ok` if endpoint responded correctly, `error` otherwise
`latency_ms`	integer	Round-trip time in milliseconds
`response_valid`	boolean	Whether the response matches the expected schema
`structured_output`	object\|null	Parsed decision output (if valid)
`error`	string\|null	Error message (if failed)

Always test a model after registering or updating it. Models with error status can still be used in cases, but may cause individual model failures.

Supported Self-Hosted Models

The following open-weight models have been tested and have dedicated per-model system prompts optimized for Aira's structured output format:

Model	Parameters	License
Llama 3.3 70B	70B	Llama Community License
Llama 4 Scout	17B active (MoE)	Llama Community License
Llama 4 Maverick	17B active (MoE)	Llama Community License
Mistral Large 3	41B active (MoE)	Apache 2.0
Qwen 3.5	397B (MoE)	Apache 2.0
DeepSeek V3.2	37B active (MoE)	MIT

Other OpenAI-compatible models can still be registered as custom models but will use generic prompting.

vLLM Setup Example

vLLM is the recommended serving engine. It supports guided_json constrained decoding, which guarantees valid structured output from any model.

# Serve a model with vLLM
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --enable-auto-tool-choice

Then register it:

curl -X POST https://api.airaproof.com/api/v1/models/custom \
  -H "Authorization: Bearer aira_live_xxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Llama 4 Scout (vLLM)",
    "model_id": "llama-4-scout",
    "endpoint_url": "https://your-gpu-server:8000/v1/chat/completions",
    "timeout_ms": 60000
  }'

When using constrained decoding, ensure your vLLM version supports guided_json (v0.6.0+). Older versions may fall back to unconstrained generation.

On this page