Working with AI Engineering Teams
Write AI feature specs that give engineering a clear build target, eval criteria, and a rollback trigger.
It's 2 AM, and your feature just broke production
Picture this. You're a PM named Priya. Your team spent six weeks building an "AI meeting notes" feature. Two weeks before launch, your lead engineer Dani pulls you aside and says:
"We don't actually know if this thing is good."
Your stomach drops. You open your spec. It reads:
"Build a feature that summarises meeting recordings. Should be accurate and fast."
That looked complete when you wrote it. But it left four massive questions unanswered — and those unanswered questions just ate six weeks of your team's life.
So what went wrong? And how do you make sure it never happens to you?
That's what this entire lesson is about. By the end, you'll know how to write a spec so clear that engineering can build against it without a single follow-up Slack thread.
The handoff is where AI features go to die
Here's the thing most PMs don't realize: AI features don't fail because of bad code. They fail because the spec never defined what "good" looks like. That gap — between what you meant and what engineering built — costs months of re-work, triggers post-launch rollbacks, and breaks trust with your engineering team.
Think about it like baking a cake. If you tell someone "make a good cake," you'll get... something. Maybe it's chocolate. Maybe it's vanilla. Maybe it's three tiers with fondant. You didn't say. But if you say "make a single-layer chocolate cake, 9 inches, that scores at least 4 out of 5 in a taste test with 10 people" — now your baker knows exactly what to aim for.
AI specs work the same way. "Accurate" is not a spec. A number is a spec.
Spot the problem
25 XPThe golden rule: Build the test BEFORE the feature
This is the single most important pattern you'll learn in this lesson. Here's the sequence that actually works:
Notice something? Engineering builds the eval harness FIRST — before writing a single line of feature code. An eval harness is just an automated test suite that checks whether the AI output meets your spec.
Why? Because without it, "done" stays subjective. With it, "done" is a green checkmark on a dashboard.
Think of it like building a house. You wouldn't start pouring concrete without a blueprint, right? The eval harness IS your blueprint. It's how everyone agrees on what the finished house looks like before anyone picks up a hammer.
And here's the part that makes your life easier as a PM: once you approve the eval baseline, you have a concrete artifact for stakeholders. Instead of saying "the team is building it," you can say "we agreed on passing criteria and engineering is building to them." That's a much better answer in a status meeting.
There Are No Dumb Questions
Q: What if we don't have enough test data to build an eval harness? That's actually a really common problem. Start small — even 50 hand-labeled examples are better than zero. You can grow the set over time. The point isn't perfection; it's having something to measure against instead of vibes.
Q: Does the PM need to understand how the eval harness works technically? You don't need to read the code. But you DO need to understand what it tests, what the pass/fail thresholds are, and how to read the results dashboard. Think of it like understanding a blood test report — you don't need to run the lab, but you need to know what the numbers mean.
Q: What if engineering pushes back on building the eval first? This is normal. Building the eval first feels slower at the start. But show them the alternative: six weeks of building, then realizing nobody knows if the thing works. The eval-first approach is like stretching before a run — it takes five minutes but saves you from a sprained ankle.
Think like an engineer
25 XPVague specs vs. real specs: spot the difference
Let's go back to Priya's story. Here's what her original spec said versus what she rewrote it to say:
| Element | Vague spec (before) | Real spec (after) |
|---|---|---|
| Speed | "Fast" | Max latency: 30 seconds for a 1-hour meeting |
| Accuracy | "Accurate" | ROUGE-L score of 0.65+ on 200 golden examples |
| Edge cases | (not mentioned) | Audio below 60% speech-to-noise ratio shows a warning instead of silently producing garbage |
| Scope | (not mentioned) | English only for v1, with explicit expansion plan |
| Rollback plan | (not mentioned) | A/B test with rollback trigger if user satisfaction drops below 3.5/5 |
What's ROUGE-L? It's a standard way to measure how closely an AI summary matches a human-written reference summary. It works by finding the longest sequence of words that appears (in order) in both the AI summary and the human reference, then computing an F1 score from that. A score of 0.65 is a strong result; a score near 0 means the summaries share almost nothing. You don't need to memorize the formula — you just need to know it gives you a number instead of a feeling.
And speech-to-noise ratio? That's just the ratio of clear speech to background noise. If someone's recording a meeting at a construction site, the audio quality is low, and the AI will produce bad summaries. Priya's spec said: "Don't silently produce bad output. Show a warning instead." That's a product decision, not an engineering decision — and it belongs in YOUR spec.
The results tell the story
Before the rewrite: Dani's team logged 47 Slack threads asking Priya to clarify what "good" meant.
After the rewrite: Zero follow-up questions. A two-week build cycle. A launch where Priya could point to a dashboard and say "ROUGE-L is 0.71 — we shipped."
The rewrite took four hours. The vague spec had already cost six weeks of uncertainty.
There Are No Dumb Questions
Q: What's a "golden example"? It's a hand-curated input-output pair that represents what a correct response looks like. If you're building a meeting summarizer, a golden example would be: a specific meeting recording (input) paired with a human-written summary of that meeting (output). You use these to test whether the AI's output is close enough to the "gold standard."
Q: Do I need to come up with all these numbers myself? Nope. Your engineer or data scientist will help you pick the right metric and the right threshold. YOUR job is to make sure a number EXISTS in the spec. If you write "accurate" without a number next to it, that's your signal to stop and ask: "What number would make us confident this is good enough?"
Q: What if I pick the wrong number? That's totally fine. The point of the eval baseline is that you can adjust it. Pick a reasonable starting number, measure against it, and tune. A wrong number is infinitely better than no number — because at least you can have a conversation about it.
Rewrite the vague spec
25 XPThe five elements of an AI feature spec
Every AI feature spec you write needs exactly five things. Think of them like the five fingers on your hand — if one is missing, your grip on the project is weak.
1. User story — Who wants what, and why?
"As a developer, I want the AI to draft a PR description from my diff so that I spend less time writing boilerplate."
2. Success metric with a specific number — Not "good." Not "fast." A number.
"Relevance score >= 0.80 as rated by a panel of 3 reviewers on a 50-example test set."
3. Eval criteria — what does PASS look like? — Write out the exact conditions.
"PASS = relevance >= 0.80 AND generated in under 3 seconds AND no hallucinated function names."
4. At least one edge case and how to handle it — What happens when things go weird?
"If the diff is larger than 2,000 lines, summarize only the first 2,000 and append: 'Note: large diff truncated.'"
5. A rollback trigger — What number tells you to pull the plug?
"If user satisfaction drops below 3.0/5 on a 7-day rolling average, auto-disable the feature and alert the PM."
That rollback trigger isn't bureaucracy — it's insurance. It pre-commits everyone to the definition of failure before emotions run high. When the feature is live and numbers are dropping, you don't want to be arguing about whether it's "bad enough" to roll back. You want a number that everyone agreed to in advance.
Think of it like a circuit breaker in your house. You don't debate whether the wiring is overloaded while your kitchen is on fire. The breaker trips automatically at a preset threshold. Your rollback trigger works the same way.
There Are No Dumb Questions
Q: What if my feature doesn't have an obvious metric? Every feature can be measured. If you can't find a standard metric, use human ratings. Have 3 people rate outputs on a 1-5 scale. That's a metric. "Average human rating >= 3.8 on 100 samples" is specific enough for an engineer to build a test around.
Q: Do I really need an edge case in every spec? Yes. AI systems fail on edge cases constantly — long inputs, empty inputs, non-English text, noisy audio. If you don't specify what happens, engineering will make the decision for you. And they'll make it based on what's easiest to code, not what's best for the user.
Q: Who decides the rollback threshold? You do, but in collaboration with engineering and leadership. The PM proposes, the team debates, and everyone signs off before code is written. That's the whole point — it's a decision made calmly, not in a crisis.
Fill in the blanks
25 XP"Accurate" is not a metric — say it with me
This deserves its own section because it's the single most common mistake PMs make in AI specs.
Every time you catch yourself writing one of these words, stop and replace it with a number:
| Don't write this | Write this instead |
|---|---|
| "Accurate" | "ROUGE-L >= 0.65 on 200 golden examples" |
| "Fast" | "p95 latency ≤ 2,000ms" |
| "Relevant" | "Relevance score >= 0.82 rated by 3 reviewers" |
| "High quality" | "Average human rating >= 4.0/5 on 100 test cases" |
| "Minimal hallucinations" | "Hallucination rate < 2% on fact-checkable claims" |
| "Good user experience" | "User satisfaction >= 3.5/5 on in-app survey" |
See the pattern? Every vague word becomes a metric name + a number + a dataset. That's the formula. Burn it into your brain.
Here's a trick that works every time: when you finish writing your spec, do a Ctrl+F for the words "accurate," "fast," "relevant," "good," and "quality." If any of them appear without a number next to them, your spec isn't done yet.
Vague word detector
25 XPThe big challenge: Write a real spec
You've learned the pattern. You've seen what vague looks like and what specific looks like. Now put it all together.
Challenge
50 XPBack to Priya and Dani
Two sprints later, Priya walked into the same meeting room, but with a different spec in her hand. Six pages. Role definition, success criteria, a golden evaluation set of 200 meeting recordings, a ROUGE-L threshold of 0.65, and a rollback trigger at 0.55.
Dani read it in ten minutes. "This is what we needed six weeks ago," she said.
"I know," said Priya. "We build the test first now."
The feature shipped. No follow-up Slack threads.
Key takeaways
- Build the test before the feature. Put eval criteria in your spec so engineering has a test harness to build BEFORE writing feature code. It's like agreeing on what "done" tastes like before you start cooking.
- "Accurate" is never a metric. Every time you write a vague word, replace it with a number: "ROUGE-L score of 0.65+ on our 200-example golden set" is a success metric; "accurate" is not.
- Include a rollback trigger in every spec. It shows you planned for when things go wrong, not just when they go right. It's your circuit breaker — it trips automatically so nobody has to argue during a crisis.
Knowledge Check
1.An engineer tells you 'the model needs more data to perform well on this slice.' What does that mean in practice, and what product decision does it force?
2.Which of the following is a complete AI acceptance criterion?
3.Your team runs an A/B test comparing two system prompts. The new prompt scores higher on relevance but lower on safety evals. How should you make the call?
4.What is output drift, how would you detect it, and who owns the response when it occurs?