Evals: Measure Before You Improve
Build a three-layer eval pyramid — heuristics, LLM-as-judge, and human review — so every prompt change is an experiment with a result.
The Friday deploy that nobody could prove was bad
Maya, an ML engineer at a fintech startup, rewrote a summarisation prompt on Friday. Monday morning, Slack blew up: "Summaries are getting worse." But Maya had no numbers. No before-and-after comparison. No way to prove the old prompt was better — or that the new one was actually worse.
She rolled back on gut feeling. The team spent two hours debating whether the rollback was the right call. Nobody had data.
The lesson: Without evals, every prompt change is a guess. With evals, every prompt change is an experiment with a result. Maya built evals before her next change — and never had that argument again.
What is an eval?
An eval (short for evaluation) is a test suite for your AI's outputs. Just like unit tests catch bugs in code, evals catch regressions in prompt quality. (In the prompt design module, you iterated through prompt versions — evals are what tell you objectively whether version N+1 is better than version N.)
But there's a key difference: code is deterministic (same input → same output), while AI is probabilistic (same input → slightly different output each time). So you can't test one response — you need to test hundreds and measure the trend.
There Are No Dumb Questions
"Isn't this just QA? Why a fancy name?"
Traditional QA checks if the output is correct (pass/fail). Evals measure how good the output is on a scale — because AI outputs aren't just right or wrong, they're better or worse. "Did the summary capture the key points?" isn't a yes/no question. It's a 1-to-5 rating.
"How many test cases do I need?"
Start with 50. That's enough to spot big regressions. Scale to 200-500 for production systems. The golden rule: more test cases = more confidence, but even 50 is infinitely better than zero.
The eval pyramid: three layers, start at the bottom
Quick prediction: How would you test whether an AI-generated summary is good? Write down the first approach that comes to mind. Then see how the eval pyramid handles it — and why it uses three different types of checks instead of one.
Think of it like a water filter with three stages:
- Layer 1 (mesh screen) catches the big debris — broken JSON, wrong format, empty responses. Free and instant.
- Layer 2 (carbon filter) catches the subtler stuff — hallucinations, bad tone, missing information. Costs a few dollars.
- Layer 3 (UV steriliser) is the final check — a human expert reviews a sample. Expensive but irreplaceable.
You run Layer 1 on every commit. Layer 2 on every deploy. Layer 3 weekly. Bad outputs get caught at the cheapest layer possible.
Layer 1: Heuristic evals (the mesh screen)
These are simple code checks — no model needed:
- Is the output valid JSON?
- Does it contain the required fields?
- Is it within the length limit?
- Does it match the expected format (e.g., starts with a bullet point)?
These run in milliseconds, cost nothing, and catch the most embarrassing failures — like shipping a prompt that returns broken JSON to your frontend.
Layer 2: LLM-as-judge (the carbon filter)
You use a second LLM to grade the first LLM's output. You give the judge a rubric:
Rate this summary on faithfulness (1-5):
Does it contain ONLY information from the source document?
1 = invents facts not in the source
5 = every claim traces back to the source
Source document: {source}
Summary to rate: {summary}
Score:
Run this on 200-500 test cases. Compare the old prompt's average score against the new prompt's average score. If the new prompt scores higher, ship it with confidence.
Cost: Under $1 per run at 2025 pricing (200 test cases × ~500 tokens each — costs have fallen significantly; verify at your provider's pricing page).
Layer 3: Human review (the UV steriliser)
A human expert reviews 50 outputs per week. This catches things no automated check can — awkward phrasing, culturally insensitive language, technically correct but unhelpful responses.
Expensive, but it's your ground truth. If the LLM-as-judge disagrees with humans, the humans are right and you need to recalibrate the judge.
| Metric | Correlation with human preference | Notes |
|---|---|---|
| Exact match | Lowest | Only works when answers must be identical word-for-word |
| BLEU / ROUGE-L | Low–moderate | Designed for translation/summarization; poor on open-ended tasks |
| LLM-as-judge | High | GPT-4 as judge achieved ~80–85% agreement with humans on studied benchmarks (Zheng et al., 2023; results vary by task and model generation) |
| Human eval | Highest (benchmark) | The gold standard — but slow and expensive |
Which layer catches it?
25 XPGolden sets: your reusable answer key
A golden set is a collection of inputs paired with ideal outputs — like a teacher's answer key. You write it once and reuse it every time you change a prompt.
Maya built a golden set of 200 examples:
| Input (document) | Golden output (ideal summary) |
|---|---|
| "Q3 revenue was $4.2M, up 18% YoY..." | "Revenue grew 18% to $4.2M in Q3." |
| "The outage lasted 47 minutes..." | "47-minute outage caused by database failover." |
| ... (198 more) | ... |
She ran both the old and new prompts against all 200 examples and scored each output with her LLM-as-judge:
| Prompt version | Faithfulness (avg) | Conciseness (avg) |
|---|---|---|
| Old prompt | 3.8 | 4.0 |
| New prompt | 4.2 | 4.3 |
The new prompt was measurably better on both criteria. She shipped it with a link to the eval results in the PR description — not a gut feeling.
The one unbreakable rule: Never use your golden set as training data. If the model "studies the test," your scores become meaningless. Keep eval data and training data completely separate.
Build a Heuristic Eval
50 XPTwo traps with LLM-as-judge
LLM judges have biases. Know them and compensate:
Trap 1: Positional bias
If you ask the judge to compare Response A vs. Response B, it tends to prefer whichever one comes first. Fix: Run the comparison in both orders (A-then-B and B-then-A) and average the scores.
Trap 2: Verbosity bias
Judges tend to rate longer responses higher, even when the shorter response is better. Fix: Score specific criteria with a rubric (faithfulness, conciseness, relevance) instead of asking "which is better overall?"
When to run each layer
| Event | Layer 1 (heuristic) | Layer 2 (LLM-judge) | Layer 3 (human) |
|---|---|---|---|
| Every git commit / CI push | Yes | No | No |
| Every deploy to staging | Yes | Yes | No |
| Weekly review cycle | Yes | Yes | Yes |
| Prompt change | Yes | Yes | No |
| Major model migration | Yes | Yes | Yes |
Key takeaways
- Without evals, every prompt change is a guess. With evals, it's an experiment with a result.
- Start with heuristic evals — they're free, instant, and catch the most embarrassing failures. You can ship them in an afternoon.
- Build a golden set of 50-200 examples — write it once, reuse it forever. Never mix it into training data.
- LLM-as-judge scales to hundreds of test cases, but watch for positional and verbosity bias.
- Humans are the ground truth. When the judge disagrees with humans, recalibrate the judge.
Knowledge Check
1.What is the key difference between reference-based and reference-free evaluation, and which is more practical for open-ended generation tasks?
2.An LLM judge is asked to score response quality on a 1–5 scale. What are two known biases that affect LLM-as-judge scores, and how can you mitigate them?
3.ROUGE-L measures what, and why is it a poor metric for tasks like summarization with multiple valid phrasings?
4.You change your system prompt and want to know if quality improved. You have 500 historical examples with recorded outputs. Describe the minimum viable eval process.