Confidence score

A scalar that estimates how reliable a model's output is for a given input.

Confidence scores come from logprobs, calibrated classifiers, ensemble agreement, or grounding strength. They drive routing decisions: high confidence flows through, low confidence routes to the reviewer queue. Calibrating confidence scores against real outcomes is part of the evaluation harness lifecycle.

When it matters

When the model can be wrong with consequences. Drives the routing logic: high-confidence outputs auto-execute, mid-confidence get reviewer queues, low-confidence trigger escalation or refusal.

Real example

A claims-extraction model that returns a structured payload + a 0-1 confidence score per field. Fields below 0.8 surface in the reviewer queue with the original document and the model's reasoning; fields above 0.9 auto-write to the claims system.

KPIs to watch

Confidence calibration (Brier score <0.15), reviewer queue rate (10-25% optimal), auto-approval accuracy (>99% on high-confidence path).

Related terms

Reviewer queue

A workflow where low-confidence or high-impact AI outputs route to a human for approval.

Evaluation harness

A test framework that scores model or prompt output against a labelled set of expected outputs.

Prompt versioning

Treating prompts as code: stored, diffed, reviewed, and rolled back like any production artifact.

Labelled test set

A frozen, hand-curated set of real input examples with expected outputs, used to score model behavior.

See it in action

We use this every week

Send a short brief and we'll walk you through how Confidence score shows up in a real engagement we're running. We reply within one business day.

Start a project →