Defined term
Labelled test set
A frozen, hand-curated set of real input examples with expected outputs, used to score model behavior.
A labelled test set captures the input distribution your workflow really sees, paired with the answers a domain expert considers correct. It is the empirical ground truth used by the evaluation harness. Test sets need to be representative (cover edge cases), versioned (frozen at promotion time), and refreshed quarterly as inputs evolve.
When it matters
Required before any prompt promotes to production. Without it, every prompt change is a guess and every regression is invisible. The most expensive AI mistake is to skip this step.
Real example
A 500-case test set assembled in week 2 of Build: 50% routine cases (high-confidence path), 30% edge cases (typical errors from production logs), 20% adversarial cases (worst-case inputs). Each case has a graded expected output reviewed by a subject-matter expert.
KPIs to watch
Test set size (>200 minimum, >500 for production-critical), test-set freshness (refreshed quarterly), pass rate per prompt version (>90% required for promotion).
Related terms
Evaluation harness
A test framework that scores model or prompt output against a labelled set of expected outputs.
Prompt versioning
Treating prompts as code: stored, diffed, reviewed, and rolled back like any production artifact.
Confidence score
A scalar that estimates how reliable a model's output is for a given input.
See it in action
We use this every week
Book a 30-min call and we'll walk you through how Labelled test set shows up in a real engagement we're running.
Book a 30-min call