Evaluation harness

A test framework that scores model or prompt output against a labelled set of expected outputs.

An evaluation harness runs candidate models or prompts against a curated set of real examples with known correct answers (or human-rated answers) and reports accuracy, recall, format compliance, latency, and cost. The harness is the production gate: nothing reaches production traffic without passing a defined threshold. Without an eval harness, AI workflows degrade silently when prompts, models, or data shift.

When it matters

When you cannot afford to ship a prompt change blind. Required for any AI workflow that touches money, customers, or compliance. Optional but valuable for internal-only tooling.

Real example

A claims-extraction harness that scores 500 labelled cases on accuracy, hallucination rate, and policy-citation precision before any prompt promotes to production. Runs in under 4 minutes, blocks promotion below 92% accuracy.

KPIs to watch

Test set size (>200 cases minimum), eval pass rate gate (typical: 90-95%), eval runtime (<10 min for daily use).

Related terms

Labelled test set

A frozen, hand-curated set of real input examples with expected outputs, used to score model behavior.

Prompt versioning

Treating prompts as code: stored, diffed, reviewed, and rolled back like any production artifact.

AI governance

Policies, processes, and controls that make an AI system auditable and accountable.

Confidence score

A scalar that estimates how reliable a model's output is for a given input.

See it in action

We use this every week

Send a short brief and we'll walk you through how Evaluation harness shows up in a real engagement we're running. We reply within one business day.

Start a project →