Defined term
Confidence score
A scalar that estimates how reliable a model's output is for a given input.
Confidence scores come from logprobs, calibrated classifiers, ensemble agreement, or grounding strength. They drive routing decisions: high confidence flows through, low confidence routes to the reviewer queue. Calibrating confidence scores against real outcomes is part of the evaluation harness lifecycle.
When it matters
When the model can be wrong with consequences. Drives the routing logic: high-confidence outputs auto-execute, mid-confidence get reviewer queues, low-confidence trigger escalation or refusal.
Real example
A claims-extraction model that returns a structured payload + a 0-1 confidence score per field. Fields below 0.8 surface in the reviewer queue with the original document and the model's reasoning; fields above 0.9 auto-write to the claims system.
KPIs to watch
Confidence calibration (Brier score <0.15), reviewer queue rate (10-25% optimal), auto-approval accuracy (>99% on high-confidence path).
Related terms
Reviewer queue
A workflow where low-confidence or high-impact AI outputs route to a human for approval.
Evaluation harness
A test framework that scores model or prompt output against a labelled set of expected outputs.
Prompt versioning
Treating prompts as code: stored, diffed, reviewed, and rolled back like any production artifact.
Labelled test set
A frozen, hand-curated set of real input examples with expected outputs, used to score model behavior.
See it in action
We use this every week
Book a 30-min call and we'll walk you through how Confidence score shows up in a real engagement we're running.
Book a 30-min call