← Glossary/Evaluation & quality

Defined term

Confidence score

A scalar that estimates how reliable a model's output is for a given input.

Confidence scores come from logprobs, calibrated classifiers, ensemble agreement, or grounding strength. They drive routing decisions: high confidence flows through, low confidence routes to the reviewer queue. Calibrating confidence scores against real outcomes is part of the evaluation harness lifecycle.

When it matters

When the model can be wrong with consequences. Drives the routing logic: high-confidence outputs auto-execute, mid-confidence get reviewer queues, low-confidence trigger escalation or refusal.

Real example

A claims-extraction model that returns a structured payload + a 0-1 confidence score per field. Fields below 0.8 surface in the reviewer queue with the original document and the model's reasoning; fields above 0.9 auto-write to the claims system.

KPIs to watch

Confidence calibration (Brier score <0.15), reviewer queue rate (10-25% optimal), auto-approval accuracy (>99% on high-confidence path).

Related terms

See it in action

We use this every week

Book a 30-min call and we'll walk you through how Confidence score shows up in a real engagement we're running.

Book a 30-min call