The AimRank Standard
Anyone can demo AI. We prove it.
Individual AI is incredible, until a business depends on it. Then it breaks on reliability, calibration, drift, audit, and scale. The AimRank Standard is the layer that turns a clever demo into a system you can trust. Every solution we ship meets four guarantees, or we tell you plainly that it can't.
The Four Guarantees™
Measured value
Every system cuts cost or grows revenue, and we prove the number, before launch and after.
A conservative ROI case is agreed up front, then the build runs against a versioned eval suite scored by an independent judge (itself validated against human labels), with task, trajectory and groundedness checks. It ships only if the suite passes, and after deploy, delivered value is tracked against that agreed number. Where a classical model is in the loop, we add probability calibration (Brier, ECE) and fairness on top.
What you get as proof: The eval report with the pass-or-fail gate, plus the value report against the agreed number.
Defensible
Every decision can be defended to a client, a board, or a regulator, with the paperwork already in hand.
Each decision carries its reasoning, the full tool-call trace (what it did, with what inputs and outputs) and the sources it relied on, recorded as it runs so a human can review, approve or override. A classical model in the loop adds per-decision SHAP reason codes. Annex IV documentation and a tamper-evident, hash-chained audit log (Article 12) are generated from the run itself. This is what satisfies EU AI Act Articles 13 and 14.
What you get as proof: The decision trace and citations for any single decision, plus the Annex IV dossier and the replayable audit chain.
Self-correcting
It never fails silently: at worst it tells you the moment it's slipping, and wherever there's a feedback signal it gets better the longer it runs.
Watched by default (the floor): every output carries calibrated confidence and the system escalates when unsure instead of bluffing; we re-run the eval suite on a schedule and on every model upgrade, watch behaviour (refusal rate, tool mix, cost) and incoming topics, and pin the model version so a provider update can't silently shift it. A classical model in the loop adds input-drift monitoring (PSI / KS). Closing the loop (the escalator): human corrections, outcomes and signals feed a living eval set, and a challenger version only promotes if it beats the incumbent on that frozen gate, every promotion human-approved, versioned and reversible. A build with no feedback signal still clears the floor; one with a signal compounds instead of decaying.
What you get as proof: The live drift dashboard, the eval-regression history and alert log, plus the versioned eval set and the champion-or-challenger promotion record.
Yours & everywhere
It runs in any cloud or on-prem, your agents can call it, and you own it. No lock-in.
The whole system ships as one container (Terraform and Kubernetes) in your own region, so your data never leaves. Every build exposes its capability as MCP tools (predict, explain, evidence, healthcheck) with tiered access, so it's a composable tool in your agent toolbelt, and autonomy is consequence-gated with human-in-the-loop and a kill-switch.
What you get as proof: The repo, the Terraform, and a live MCP endpoint your agents can call: all yours to keep.
How a build actually runs
Six steps, with a human checkpoint at each one.
- 1 Scope & risk-class
Define the buyer, the wedge, and the risk class. Legal sign-off where it carries weight.
- 2 Data intake
We point at your data in your own store. No personal data leaves your infrastructure.
- 3 Model, calibration & bias audit
Baseline first (classical often beats deep), validate calibration, review disparate impact.
- 4 Drift baseline & feedback loop
A real reference window, a nightly drift job and alerting before launch, plus feedback capture wired into a living eval set and champion-or-challenger promotion.
- 5 Evidence & dossier
Complete the validation docs and verify the audit chain.
- 6 Deploy & handover
Cloud-agnostic deploy, runbooks, and training so you own the system.
For agents: Agent Assurance
The four guarantees secure each tool an agent calls. When agents start to act, we lift assurance to the decision and action level: we evaluate the decision and the whole trajectory (not one output), watch for behavioral drift, trace every action into a tamper-evident log you can replay, and gate autonomy with guardrails, human-in-the-loop, and a kill-switch. Autonomy without assurance is recklessness; agents make assurance more valuable, not less.
Built for agents, not just screens
Every system ships as an MCP server (predict, explain, evidence, healthcheck), so your agents can use it directly, not just your people.
This page describes the methodology and the evidence each step produces. The implementations live inside the AimRank solutions we build for you.