Evals: Measure Before You Improve — Building AI-Powered Products

The Friday deploy that nobody could prove was bad

Maya, an ML engineer at a fintech startup, rewrote a summarisation prompt on Friday. Monday morning, Slack blew up: "Summaries are getting worse." But Maya had no numbers. No before-and-after comparison. No way to prove the old prompt was better — or that the new one was actually worse.

She rolled back on gut feeling. The team spent two hours debating whether the rollback was the right call. Nobody had data.

The lesson: Without evals, every prompt change is a guess. With evals, every prompt change is an experiment with a result. Maya built evals before her next change — and never had that argument again.

What is an eval?

An eval (short for evaluation) is a test suite for your AI's outputs. Just like unit tests catch bugs in code, evals catch regressions in prompt quality. (In the prompt design module, you iterated through prompt versions — evals are what tell you objectively whether version N+1 is better than version N.)

But there's a key difference: code is deterministic (same input → same output), while AI is probabilistic (same input → slightly different output each time). So you can't test one response — you need to test hundreds and measure the trend.

💭You're Probably Wondering…

There Are No Dumb Questions

"Isn't this just QA? Why a fancy name?"

Traditional QA checks if the output is correct (pass/fail). Evals measure how good the output is on a scale — because AI outputs aren't just right or wrong, they're better or worse. "Did the summary capture the key points?" isn't a yes/no question. It's a 1-to-5 rating.

"How many test cases do I need?"

Start with 50. That's enough to spot big regressions. Scale to 200-500 for production systems. The golden rule: more test cases = more confidence, but even 50 is infinitely better than zero.

The eval pyramid: three layers, start at the bottom

Quick prediction: How would you test whether an AI-generated summary is good? Write down the first approach that comes to mind. Then see how the eval pyramid handles it — and why it uses three different types of checks instead of one.

Think of it like a water filter with three stages:

Layer 1 (mesh screen) catches the big debris — broken JSON, wrong format, empty responses. Free and instant.
Layer 2 (carbon filter) catches the subtler stuff — hallucinations, bad tone, missing information. Costs a few dollars.
Layer 3 (UV steriliser) is the final check — a human expert reviews a sample. Expensive but irreplaceable.

You run Layer 1 on every commit. Layer 2 on every deploy. Layer 3 weekly. Bad outputs get caught at the cheapest layer possible.

Layer 1: Heuristic evals (the mesh screen)

These are simple code checks — no model needed:

Is the output valid JSON?
Does it contain the required fields?
Is it within the length limit?
Does it match the expected format (e.g., starts with a bullet point)?

These run in milliseconds, cost nothing, and catch the most embarrassing failures — like shipping a prompt that returns broken JSON to your frontend.

Layer 2: LLM-as-judge (the carbon filter)

You use a second LLM to grade the first LLM's output. You give the judge a rubric:

Rate this summary on faithfulness (1-5):
Does it contain ONLY information from the source document?
1 = invents facts not in the source
5 = every claim traces back to the source

Source document: {source}
Summary to rate: {summary}
Score:

Run this on 200-500 test cases. Compare the old prompt's average score against the new prompt's average score. If the new prompt scores higher, ship it with confidence.

Cost: Under $1 per run at 2025 pricing (200 test cases × ~500 tokens each — costs have fallen significantly; verify at your provider's pricing page).

Layer 3: Human review (the UV steriliser)

A human expert reviews 50 outputs per week. This catches things no automated check can — awkward phrasing, culturally insensitive language, technically correct but unhelpful responses.

Expensive, but it's your ground truth. If the LLM-as-judge disagrees with humans, the humans are right and you need to recalibrate the judge.

Metric	Correlation with human preference	Notes
Exact match	Lowest	Only works when answers must be identical word-for-word
BLEU / ROUGE-L	Low–moderate	Designed for translation/summarization; poor on open-ended tasks
LLM-as-judge	High	GPT-4 as judge achieved ~80–85% agreement with humans on studied benchmarks (Zheng et al., 2023; results vary by task and model generation)
Human eval	Highest (benchmark)	The gold standard — but slow and expensive

⚡

Which layer catches it?

25 XP

For each failure, write which eval layer would catch it first: **Layer 1 (heuristic)**, **Layer 2 (LLM-judge)**, or **Layer 3 (human)**. | Failure | Layer | |---------|-------| | The model returns `{"summary": "..."}` but your code expects `{"result": "..."}` | ? | | The summary is factually accurate but sounds robotic and unfriendly | ? | | The model invents a statistic that isn't in the source document | ? | | The model returns an empty string instead of a summary | ? | | The summary is technically correct but uses jargon a customer wouldn't understand | ? | _Hint: If you can check it with a regex or `json.loads()`, it's Layer 1. If you need to evaluate content quality, it's Layer 2. If it requires human judgment about tone or usability, it's Layer 3._

Golden sets: your reusable answer key

A golden set is a collection of inputs paired with ideal outputs — like a teacher's answer key. You write it once and reuse it every time you change a prompt.

Maya built a golden set of 200 examples:

Input (document)	Golden output (ideal summary)
"Q3 revenue was $4.2M, up 18% YoY..."	"Revenue grew 18% to $4.2M in Q3."
"The outage lasted 47 minutes..."	"47-minute outage caused by database failover."
... (198 more)	...

She ran both the old and new prompts against all 200 examples and scored each output with her LLM-as-judge:

Prompt version	Faithfulness (avg)	Conciseness (avg)
Old prompt	3.8	4.0
New prompt	4.2	4.3

The new prompt was measurably better on both criteria. She shipped it with a link to the eval results in the PR description — not a gut feeling.

The one unbreakable rule: Never use your golden set as training data. If the model "studies the test," your scores become meaningless. Keep eval data and training data completely separate.

⚡

Build a Heuristic Eval

50 XP

Write a Python function called `check_output` that takes one argument — a string called `response` — and returns `True` or `False`. It should check three things in order: 1. Is `response` valid JSON? 2. Does the JSON contain both keys `"summary"` and `"confidence"`? 3. Is `"confidence"` a number between 0 and 1 (inclusive)? Return `True` only if all three pass. Keep it under 15 lines. **Test cases:** - `check_output('{"summary": "short", "confidence": 0.9}')` → `True` - `check_output('{"summary": "short"}')` → `False` (missing confidence) - `check_output('not json at all')` → `False` (invalid JSON) - `check_output('{"summary": "x", "confidence": 1.5}')` → `False` (out of range) _Hint: Wrap `json.loads()` in a `try/except`. Check for keys with `"summary" in data and "confidence" in data`. Check the range with `0 <= data["confidence"] <= 1`._

Two traps with LLM-as-judge

LLM judges have biases. Know them and compensate:

Trap 1: Positional bias

If you ask the judge to compare Response A vs. Response B, it tends to prefer whichever one comes first. Fix: Run the comparison in both orders (A-then-B and B-then-A) and average the scores.

Trap 2: Verbosity bias

Judges tend to rate longer responses higher, even when the shorter response is better. Fix: Score specific criteria with a rubric (faithfulness, conciseness, relevance) instead of asking "which is better overall?"

🔑LLM-as-judge is now the pragmatic standard

Using a strong model (GPT-4, Claude) to evaluate outputs of a weaker model has become standard practice. It's cheaper than human evaluation, faster than BLEU/ROUGE, and correlates well with human judgment on open-ended tasks. The catch: the judge model can share the same biases as the model being evaluated.

When to run each layer

Event	Layer 1 (heuristic)	Layer 2 (LLM-judge)	Layer 3 (human)
Every git commit / CI push	Yes	No	No
Every deploy to staging	Yes	Yes	No
Weekly review cycle	Yes	Yes	Yes
Prompt change	Yes	Yes	No
Major model migration	Yes	Yes	Yes

Key takeaways

Without evals, every prompt change is a guess. With evals, it's an experiment with a result.
Start with heuristic evals — they're free, instant, and catch the most embarrassing failures. You can ship them in an afternoon.
Build a golden set of 50-200 examples — write it once, reuse it forever. Never mix it into training data.
LLM-as-judge scales to hundreds of test cases, but watch for positional and verbosity bias.
Humans are the ground truth. When the judge disagrees with humans, recalibrate the judge.

Knowledge Check

1.What is the key difference between reference-based and reference-free evaluation, and which is more practical for open-ended generation tasks?

2.An LLM judge is asked to score response quality on a 1–5 scale. What are two known biases that affect LLM-as-judge scores, and how can you mitigate them?

3.ROUGE-L measures what, and why is it a poor metric for tasks like summarization with multiple valid phrasings?

4.You change your system prompt and want to know if quality improved. You have 500 historical examples with recorded outputs. Describe the minimum viable eval process.