← Glossary/Evaluation & quality

Defined term

Evaluation harness

A test framework that scores model or prompt output against a labelled set of expected outputs.

An evaluation harness runs candidate models or prompts against a curated set of real examples with known correct answers (or human-rated answers) and reports accuracy, recall, format compliance, latency, and cost. The harness is the production gate: nothing reaches production traffic without passing a defined threshold. Without an eval harness, AI workflows degrade silently when prompts, models, or data shift.

When it matters

When you cannot afford to ship a prompt change blind. Required for any AI workflow that touches money, customers, or compliance. Optional but valuable for internal-only tooling.

Real example

A claims-extraction harness that scores 500 labelled cases on accuracy, hallucination rate, and policy-citation precision before any prompt promotes to production. Runs in under 4 minutes, blocks promotion below 92% accuracy.

KPIs to watch

Test set size (>200 cases minimum), eval pass rate gate (typical: 90-95%), eval runtime (<10 min for daily use).

Related terms

See it in action

We use this every week

Book a 30-min call and we'll walk you through how Evaluation harness shows up in a real engagement we're running.

Book a 30-min call