← Glossary/Evaluation & quality

Defined term

Labelled test set

A frozen, hand-curated set of real input examples with expected outputs, used to score model behavior.

A labelled test set captures the input distribution your workflow really sees, paired with the answers a domain expert considers correct. It is the empirical ground truth used by the evaluation harness. Test sets need to be representative (cover edge cases), versioned (frozen at promotion time), and refreshed quarterly as inputs evolve.

When it matters

Required before any prompt promotes to production. Without it, every prompt change is a guess and every regression is invisible. The most expensive AI mistake is to skip this step.

Real example

A 500-case test set assembled in week 2 of Build: 50% routine cases (high-confidence path), 30% edge cases (typical errors from production logs), 20% adversarial cases (worst-case inputs). Each case has a graded expected output reviewed by a subject-matter expert.

KPIs to watch

Test set size (>200 minimum, >500 for production-critical), test-set freshness (refreshed quarterly), pass rate per prompt version (>90% required for promotion).

Related terms

See it in action

We use this every week

Book a 30-min call and we'll walk you through how Labelled test set shows up in a real engagement we're running.

Book a 30-min call