Working with AI Engineering Teams — Leading AI Products

It's 2 AM, and your feature just broke production

Picture this. You're a PM named Priya. Your team spent six weeks building an "AI meeting notes" feature. Two weeks before launch, your lead engineer Dani pulls you aside and says:

💭You're Probably Wondering…

"We don't actually know if this thing is good."

Your stomach drops. You open your spec. It reads:

💭You're Probably Wondering…

"Build a feature that summarises meeting recordings. Should be accurate and fast."

That looked complete when you wrote it. But it left four massive questions unanswered — and those unanswered questions just ate six weeks of your team's life.

So what went wrong? And how do you make sure it never happens to you?

That's what this entire lesson is about. By the end, you'll know how to write a spec so clear that engineering can build against it without a single follow-up Slack thread.

The handoff is where AI features go to die

Here's the thing most PMs don't realize: AI features don't fail because of bad code. They fail because the spec never defined what "good" looks like. That gap — between what you meant and what engineering built — costs months of re-work, triggers post-launch rollbacks, and breaks trust with your engineering team.

Think about it like baking a cake. If you tell someone "make a good cake," you'll get... something. Maybe it's chocolate. Maybe it's vanilla. Maybe it's three tiers with fondant. You didn't say. But if you say "make a single-layer chocolate cake, 9 inches, that scores at least 4 out of 5 in a taste test with 10 people" — now your baker knows exactly what to aim for.

AI specs work the same way. "Accurate" is not a spec. A number is a spec.

⚡

Spot the problem

25 XP

Look at this spec line: *"The chatbot should provide helpful and accurate responses quickly."* Write down three questions that an engineer would need to ask before they could start building. (Hint: think about what "helpful" means, what "accurate" means, and what "quickly" means in milliseconds.)

RoughDiagram: invalid JSON

💡What to include in an AI feature spec

Beyond the standard PRD (Product Requirements Document) sections, AI feature specs need: (1) example inputs and expected outputs, (2) explicit failure modes and acceptable error rate, (3) how the feature behaves when the model is wrong, (4) who reviews model output before it reaches users, and (5) what data we'll use to evaluate success. Without these, engineering will make decisions that should be yours.

The golden rule: Build the test BEFORE the feature

This is the single most important pattern you'll learn in this lesson. Here's the sequence that actually works:

Notice something? Engineering builds the eval harness FIRST — before writing a single line of feature code. An eval harness is just an automated test suite that checks whether the AI output meets your spec.

Why? Because without it, "done" stays subjective. With it, "done" is a green checkmark on a dashboard.

Think of it like building a house. You wouldn't start pouring concrete without a blueprint, right? The eval harness IS your blueprint. It's how everyone agrees on what the finished house looks like before anyone picks up a hammer.

And here's the part that makes your life easier as a PM: once you approve the eval baseline, you have a concrete artifact for stakeholders. Instead of saying "the team is building it," you can say "we agreed on passing criteria and engineering is building to them." That's a much better answer in a status meeting.

💭You're Probably Wondering…

There Are No Dumb Questions

Q: What if we don't have enough test data to build an eval harness? That's actually a really common problem. Start small — even 50 hand-labeled examples are better than zero. You can grow the set over time. The point isn't perfection; it's having something to measure against instead of vibes.

Q: Does the PM need to understand how the eval harness works technically? You don't need to read the code. But you DO need to understand what it tests, what the pass/fail thresholds are, and how to read the results dashboard. Think of it like understanding a blood test report — you don't need to run the lab, but you need to know what the numbers mean.

Q: What if engineering pushes back on building the eval first? This is normal. Building the eval first feels slower at the start. But show them the alternative: six weeks of building, then realizing nobody knows if the thing works. The eval-first approach is like stretching before a run — it takes five minutes but saves you from a sprained ankle.

⚡

Think like an engineer

25 XP

You're building an AI feature that auto-generates email subject lines. Write down ONE eval criterion that is specific enough for an engineer to turn into an automated test. Remember: no vague words like "good" or "relevant." _Hint: Think about length, click-through rate, or a human rating scale with a minimum threshold._

Vague specs vs. real specs: spot the difference

Let's go back to Priya's story. Here's what her original spec said versus what she rewrote it to say:

Element	Vague spec (before)	Real spec (after)
Speed	"Fast"	Max latency: 30 seconds for a 1-hour meeting
Accuracy	"Accurate"	ROUGE-L score of 0.65+ on 200 golden examples
Edge cases	(not mentioned)	Audio below 60% speech-to-noise ratio shows a warning instead of silently producing garbage
Scope	(not mentioned)	English only for v1, with explicit expansion plan
Rollback plan	(not mentioned)	A/B test with rollback trigger if user satisfaction drops below 3.5/5

What's ROUGE-L? It's a standard way to measure how closely an AI summary matches a human-written reference summary. It works by finding the longest sequence of words that appears (in order) in both the AI summary and the human reference, then computing an F1 score from that. A score of 0.65 is a strong result; a score near 0 means the summaries share almost nothing. You don't need to memorize the formula — you just need to know it gives you a number instead of a feeling.

And speech-to-noise ratio? That's just the ratio of clear speech to background noise. If someone's recording a meeting at a construction site, the audio quality is low, and the AI will produce bad summaries. Priya's spec said: "Don't silently produce bad output. Show a warning instead." That's a product decision, not an engineering decision — and it belongs in YOUR spec.

The results tell the story

Before the rewrite: Dani's team logged 47 Slack threads asking Priya to clarify what "good" meant.

After the rewrite: Zero follow-up questions. A two-week build cycle. A launch where Priya could point to a dashboard and say "ROUGE-L is 0.71 — we shipped."

The rewrite took four hours. The vague spec had already cost six weeks of uncertainty.

💭You're Probably Wondering…

There Are No Dumb Questions

Q: What's a "golden example"? It's a hand-curated input-output pair that represents what a correct response looks like. If you're building a meeting summarizer, a golden example would be: a specific meeting recording (input) paired with a human-written summary of that meeting (output). You use these to test whether the AI's output is close enough to the "gold standard."

Q: Do I need to come up with all these numbers myself? Nope. Your engineer or data scientist will help you pick the right metric and the right threshold. YOUR job is to make sure a number EXISTS in the spec. If you write "accurate" without a number next to it, that's your signal to stop and ask: "What number would make us confident this is good enough?"

Q: What if I pick the wrong number? That's totally fine. The point of the eval baseline is that you can adjust it. Pick a reasonable starting number, measure against it, and tune. A wrong number is infinitely better than no number — because at least you can have a conversation about it.

⚡

Rewrite the vague spec

25 XP

Rewrite this vague spec line into something an engineer could actually test: *"The AI search feature should return relevant results."* Your rewrite should include: a metric name, a threshold number, and what dataset you'd test against. _Hint: something like "Relevance score of [X] on [Y test set] measured by [Z method]."_

The five elements of an AI feature spec

Every AI feature spec you write needs exactly five things. Think of them like the five fingers on your hand — if one is missing, your grip on the project is weak.

1. User story — Who wants what, and why?

💭You're Probably Wondering…

"As a developer, I want the AI to draft a PR description from my diff so that I spend less time writing boilerplate."

2. Success metric with a specific number — Not "good." Not "fast." A number.

💭You're Probably Wondering…

"Relevance score >= 0.80 as rated by a panel of 3 reviewers on a 50-example test set."

3. Eval criteria — what does PASS look like? — Write out the exact conditions.

💭You're Probably Wondering…

"PASS = relevance >= 0.80 AND generated in under 3 seconds AND no hallucinated function names."

4. At least one edge case and how to handle it — What happens when things go weird?

💭You're Probably Wondering…

"If the diff is larger than 2,000 lines, summarize only the first 2,000 and append: 'Note: large diff truncated.'"

5. A rollback trigger — What number tells you to pull the plug?

💭You're Probably Wondering…

"If user satisfaction drops below 3.0/5 on a 7-day rolling average, auto-disable the feature and alert the PM."

That rollback trigger isn't bureaucracy — it's insurance. It pre-commits everyone to the definition of failure before emotions run high. When the feature is live and numbers are dropping, you don't want to be arguing about whether it's "bad enough" to roll back. You want a number that everyone agreed to in advance.

Think of it like a circuit breaker in your house. You don't debate whether the wiring is overloaded while your kitchen is on fire. The breaker trips automatically at a preset threshold. Your rollback trigger works the same way.

💭You're Probably Wondering…

There Are No Dumb Questions

Q: What if my feature doesn't have an obvious metric? Every feature can be measured. If you can't find a standard metric, use human ratings. Have 3 people rate outputs on a 1-5 scale. That's a metric. "Average human rating >= 3.8 on 100 samples" is specific enough for an engineer to build a test around.

Q: Do I really need an edge case in every spec? Yes. AI systems fail on edge cases constantly — long inputs, empty inputs, non-English text, noisy audio. If you don't specify what happens, engineering will make the decision for you. And they'll make it based on what's easiest to code, not what's best for the user.

Q: Who decides the rollback threshold? You do, but in collaboration with engineering and leadership. The PM proposes, the team debates, and everyone signs off before code is written. That's the whole point — it's a decision made calmly, not in a crisis.

⚡

Fill in the blanks

25 XP

You're writing a spec for an AI feature that auto-tags customer support tickets by category (billing, technical, account, etc.). Fill in these five elements: 1. **User story:** "As a _______, I want _______ so that _______." 2. **Success metric:** Accuracy of ___% on a test set of ___ tickets. 3. **Eval criteria:** PASS = _______. 4. **Edge case:** When a ticket fits multiple categories, _______. 5. **Rollback trigger:** If _______, disable the feature.

"Accurate" is not a metric — say it with me

This deserves its own section because it's the single most common mistake PMs make in AI specs.

Every time you catch yourself writing one of these words, stop and replace it with a number:

Don't write this	Write this instead
"Accurate"	"ROUGE-L >= 0.65 on 200 golden examples"
"Fast"	"p95 latency ≤ 2,000ms"
"Relevant"	"Relevance score >= 0.82 rated by 3 reviewers"
"High quality"	"Average human rating >= 4.0/5 on 100 test cases"
"Minimal hallucinations"	"Hallucination rate < 2% on fact-checkable claims"
"Good user experience"	"User satisfaction >= 3.5/5 on in-app survey"

See the pattern? Every vague word becomes a metric name + a number + a dataset. That's the formula. Burn it into your brain.

Here's a trick that works every time: when you finish writing your spec, do a Ctrl+F for the words "accurate," "fast," "relevant," "good," and "quality." If any of them appear without a number next to them, your spec isn't done yet.

⚡

Vague word detector

25 XP

Here's a spec paragraph. Find ALL the vague words and rewrite each one with a specific metric: *"The AI chatbot should provide accurate answers to customer questions. It should respond quickly and handle most common question types. The responses should be high quality and helpful."* _Hint: there are at least 5 vague words hiding in there._

The big challenge: Write a real spec

You've learned the pattern. You've seen what vague looks like and what specific looks like. Now put it all together.

⚡

Challenge

50 XP

Write a one-page AI feature spec for a "draft PR description" feature in a developer tool. Include all five elements: (1) a user story ("As a developer, I want…"), (2) a success metric with a specific number (e.g., relevance score >= X), (3) eval criteria — write out exactly what PASS looks like, (4) one edge case and how to handle it, (5) a rollback trigger. _Hint: Start with the user story. Write: "As a developer, I want the AI to draft a PR description from my diff so that I spend less time writing boilerplate." Then define what score or rating makes that description good enough to ship._

Back to Priya and Dani

Two sprints later, Priya walked into the same meeting room, but with a different spec in her hand. Six pages. Role definition, success criteria, a golden evaluation set of 200 meeting recordings, a ROUGE-L threshold of 0.65, and a rollback trigger at 0.55.

Dani read it in ten minutes. "This is what we needed six weeks ago," she said.

"I know," said Priya. "We build the test first now."

The feature shipped. No follow-up Slack threads.

Key takeaways

Build the test before the feature. Put eval criteria in your spec so engineering has a test harness to build BEFORE writing feature code. It's like agreeing on what "done" tastes like before you start cooking.
"Accurate" is never a metric. Every time you write a vague word, replace it with a number: "ROUGE-L score of 0.65+ on our 200-example golden set" is a success metric; "accurate" is not.
Include a rollback trigger in every spec. It shows you planned for when things go wrong, not just when they go right. It's your circuit breaker — it trips automatically so nobody has to argue during a crisis.

Knowledge Check

1.An engineer tells you 'the model needs more data to perform well on this slice.' What does that mean in practice, and what product decision does it force?

2.Which of the following is a complete AI acceptance criterion?

3.Your team runs an A/B test comparing two system prompts. The new prompt scores higher on relevance but lower on safety evals. How should you make the call?

4.What is output drift, how would you detect it, and who owns the response when it occurs?

It's 2 AM, and your feature just broke production

Picture this. You're a PM named Priya. Your team spent six weeks building an "AI meeting notes" feature. Two weeks before launch, your lead engineer Dani pulls you aside and says:

💭You're Probably Wondering…

"We don't actually know if this thing is good."

Your stomach drops. You open your spec. It reads:

💭You're Probably Wondering…

"Build a feature that summarises meeting recordings. Should be accurate and fast."

That looked complete when you wrote it. But it left four massive questions unanswered — and those unanswered questions just ate six weeks of your team's life.

So what went wrong? And how do you make sure it never happens to you?

That's what this entire lesson is about. By the end, you'll know how to write a spec so clear that engineering can build against it without a single follow-up Slack thread.

The handoff is where AI features go to die

AI specs work the same way. "Accurate" is not a spec. A number is a spec.

⚡

Spot the problem

25 XP

RoughDiagram: invalid JSON

💡What to include in an AI feature spec

The golden rule: Build the test BEFORE the feature

This is the single most important pattern you'll learn in this lesson. Here's the sequence that actually works:

Why? Because without it, "done" stays subjective. With it, "done" is a green checkmark on a dashboard.

💭You're Probably Wondering…

There Are No Dumb Questions

⚡

Think like an engineer

25 XP

Vague specs vs. real specs: spot the difference

Let's go back to Priya's story. Here's what her original spec said versus what she rewrote it to say:

Element	Vague spec (before)	Real spec (after)
Speed	"Fast"	Max latency: 30 seconds for a 1-hour meeting
Accuracy	"Accurate"	ROUGE-L score of 0.65+ on 200 golden examples
Edge cases	(not mentioned)	Audio below 60% speech-to-noise ratio shows a warning instead of silently producing garbage
Scope	(not mentioned)	English only for v1, with explicit expansion plan
Rollback plan	(not mentioned)	A/B test with rollback trigger if user satisfaction drops below 3.5/5

The results tell the story

Before the rewrite: Dani's team logged 47 Slack threads asking Priya to clarify what "good" meant.

After the rewrite: Zero follow-up questions. A two-week build cycle. A launch where Priya could point to a dashboard and say "ROUGE-L is 0.71 — we shipped."

The rewrite took four hours. The vague spec had already cost six weeks of uncertainty.

💭You're Probably Wondering…

There Are No Dumb Questions

⚡

Rewrite the vague spec

25 XP

The five elements of an AI feature spec

Every AI feature spec you write needs exactly five things. Think of them like the five fingers on your hand — if one is missing, your grip on the project is weak.

1. User story — Who wants what, and why?

💭You're Probably Wondering…

"As a developer, I want the AI to draft a PR description from my diff so that I spend less time writing boilerplate."

2. Success metric with a specific number — Not "good." Not "fast." A number.

💭You're Probably Wondering…

"Relevance score >= 0.80 as rated by a panel of 3 reviewers on a 50-example test set."

3. Eval criteria — what does PASS look like? — Write out the exact conditions.

💭You're Probably Wondering…

"PASS = relevance >= 0.80 AND generated in under 3 seconds AND no hallucinated function names."

4. At least one edge case and how to handle it — What happens when things go weird?

💭You're Probably Wondering…

"If the diff is larger than 2,000 lines, summarize only the first 2,000 and append: 'Note: large diff truncated.'"

5. A rollback trigger — What number tells you to pull the plug?

💭You're Probably Wondering…

"If user satisfaction drops below 3.0/5 on a 7-day rolling average, auto-disable the feature and alert the PM."

💭You're Probably Wondering…

There Are No Dumb Questions

⚡

Fill in the blanks

25 XP

"Accurate" is not a metric — say it with me

This deserves its own section because it's the single most common mistake PMs make in AI specs.

Every time you catch yourself writing one of these words, stop and replace it with a number:

Don't write this	Write this instead
"Accurate"	"ROUGE-L >= 0.65 on 200 golden examples"
"Fast"	"p95 latency ≤ 2,000ms"
"Relevant"	"Relevance score >= 0.82 rated by 3 reviewers"
"High quality"	"Average human rating >= 4.0/5 on 100 test cases"
"Minimal hallucinations"	"Hallucination rate < 2% on fact-checkable claims"
"Good user experience"	"User satisfaction >= 3.5/5 on in-app survey"

See the pattern? Every vague word becomes a metric name + a number + a dataset. That's the formula. Burn it into your brain.

⚡

Vague word detector

25 XP

The big challenge: Write a real spec

You've learned the pattern. You've seen what vague looks like and what specific looks like. Now put it all together.

⚡

Challenge

50 XP

Back to Priya and Dani

Dani read it in ten minutes. "This is what we needed six weeks ago," she said.

"I know," said Priya. "We build the test first now."

The feature shipped. No follow-up Slack threads.

Key takeaways

Build the test before the feature. Put eval criteria in your spec so engineering has a test harness to build BEFORE writing feature code. It's like agreeing on what "done" tastes like before you start cooking.
"Accurate" is never a metric. Every time you write a vague word, replace it with a number: "ROUGE-L score of 0.65+ on our 200-example golden set" is a success metric; "accurate" is not.
Include a rollback trigger in every spec. It shows you planned for when things go wrong, not just when they go right. It's your circuit breaker — it trips automatically so nobody has to argue during a crisis.

Knowledge Check

1.An engineer tells you 'the model needs more data to perform well on this slice.' What does that mean in practice, and what product decision does it force?

2.Which of the following is a complete AI acceptance criterion?

3.Your team runs an A/B test comparing two system prompts. The new prompt scores higher on relevance but lower on safety evals. How should you make the call?

4.What is output drift, how would you detect it, and who owns the response when it occurs?