Prompt Design for PMs — Leading AI Products

A two-sentence prompt is a two-sentence spec

Here's the system prompt for a real AI feature that went to production:

💭You're Probably Wondering…

"You are a helpful assistant. Reply to customer emails."

That's it. The entire product spec for a feature that runs on every customer interaction. No role definition. No format rules. No constraints. No examples.

What happened? The bot wrote 500-word replies. Used the wrong tone. Occasionally recommended competitor products. The team spent three weeks debugging "the model" before someone looked at the prompt.

The lesson: A system prompt IS a product spec, written in plain English. A vague prompt causes the same downstream pain as a vague PRD — except a bad prompt reaches every user on every request until someone catches it.

The four ingredients (same as engineering — but you own the spec)

Engineers write the prompt. You write the requirements for the prompt. Same four ingredients, different ownership:

Ingredient	Engineering owns	PM owns
Role	Implementation	Who should this AI be? What persona?
Goal	Implementation	What specific task should it accomplish?
Format	Implementation	How long? What structure? What tone?
Constraints	Implementation	What should it NEVER do?

Think of it like a job posting. You wouldn't hire someone with just "be helpful and do work." You'd specify the role, the responsibilities, the standards, and the deal-breakers. A system prompt is a job posting for your AI.

PM insight: Constraints cause the most damage when missing. The model locks in its role, goal, and format first — and only then reaches the "what not to do" section. If that section is empty, the model will fill in the blanks with its own defaults. Those defaults are tuned for general helpfulness, not your product's specific needs.

💭You're Probably Wondering…

There Are No Dumb Questions

"Should I write the system prompt myself?"

No — but you should write the requirements for it. Just like you don't write the code, but you write the acceptance criteria. "The model should never mention competitor products" is a PM requirement. The engineer translates it into the prompt. Then you eval it.

"How do I know if the prompt is good enough?"

You eval it. Score 50-200 outputs on the criteria you care about (accuracy, tone, length, safety). If it meets your bar, ship it. If not, iterate. Numbers, not vibes.

⚡

Write the Acceptance Criteria

25 XP

You're the PM for Notion's "Draft message" feature — it generates suggested replies to comments. Your engineer wrote this system prompt: `"You are a helpful assistant. Reply to comments helpfully."` **Your job:** Write 4 testable acceptance criteria that the system prompt must satisfy before shipping. Each criterion should be measurable with an automated eval. Example: "When a comment asks a yes/no question, the reply must answer yes or no in the first sentence." Write 4 more at that level of specificity. _Hint: Cover these areas: (1) length/format, (2) tone, (3) scope (what topics it should and shouldn't address), (4) a specific failure case to prevent._

Real example: Sofia fixes Loom's summary feature

Sofia is a PM at Loom (the screen recording tool). Her team's "draft summary" feature generates text summaries of video recordings. The v1 prompt? "Summarise this video transcript."

She iterated through four versions, running an eval (3 reviewers, 200 recordings, usefulness rated 1-5) after each:

Version	What changed	Score	What happened
V1	`"Summarise this video transcript."`	2.9/5	Too long, included timestamps, described screen content verbatim
V2	Added role + format: `"You are a meeting notes assistant. Write a 3-bullet summary. Each bullet starts with an action verb."`	3.8/5	Better structure, but still narrating the video
V3	Added constraints: `"Never mention timestamps. Focus on decisions and action items, not activities."`	4.5/5	Much better — but tone still inconsistent
V4	Added one example of a good summary	4.7/5	Nailed it

Three things to notice:

The biggest quality jumps came from constraints and examples, not from elaborate descriptions. "Never mention timestamps" did more work than 50 words of positive instruction.
One example beat 100 words of instructions. V4 added a single example showing what a good summary looks like. That one example did more for quality than all three previous iterations combined.

This technique — providing worked examples in the prompt — is called few-shot prompting. Zero examples = zero-shot; one = one-shot; two to five = few-shot. Few-shot examples are the fastest way to say "produce output exactly like this." The failure mode: if your examples share an accidental pattern (e.g., all are about software), the model may overgeneralise and apply that pattern to inputs where it doesn't fit.
The whole iteration took 4 hours. The team's alternative plan (fine-tuning the model) was estimated at 6 weeks.

💭You're Probably Wondering…

There Are No Dumb Questions

"What if the engineer pushes back and says the prompt is 'good enough'?"

Ask for the eval numbers. "Good enough" without data is a gut feeling. Show them a low-scoring output from the eval set and ask: "Would you ship this to a customer?" The conversation shifts from opinion to evidence.

Versioning: treat prompts like code, not drafts

This is the mistake that bites teams: an engineer edits the system prompt directly in production on a Friday afternoon. Nobody notices until Monday when quality drops.

The PM rule: system prompt changes follow the same process as code changes.

Rule	Why
Version every prompt in source control	So you can roll back to the version that worked
Run an eval before AND after every change	So you know if quality went up or down
Never edit in production without testing	Same reason you don't push code without CI
Log which prompt version each user saw	So when a user reports a problem, you can trace it

🔑Prompt iteration is product iteration

Your first prompt will produce mediocre output. Your tenth will produce good output. Your fiftieth will produce great output. Treat prompt engineering like product development: ship a rough version, measure what's wrong, improve, repeat. Keep a prompt changelog so you know what changed when quality improved or degraded.

⚡

Prompt Spec Review

50 XP

You're reviewing a feature spec before it goes to engineering. The feature: a customer support AI for an e-commerce platform. The spec says: "The AI should be helpful and answer customer questions." **Your job:** Rewrite this into four specific acceptance criteria — one for each ingredient (role, goal, format, constraints). Make each one testable. Then answer: What's the first eval you'd run to catch the most common failure mode? _Hint: The most common failure mode for a support AI is giving wrong information about policies (hallucination). Your first eval should test accuracy against known-correct answers to common questions: "What's the return policy?" "How long is shipping?" Score the AI's answers against the actual policy documents._

Back to that two-sentence system prompt. After three weeks of debugging "the model," the team rewrote the system prompt over a single afternoon. They added a role, a format constraint (max three sentences per reply), a constraint list (no competitor mentions, no unsolicited upsells), and one example of a good response. They ran an eval over 100 historical tickets. Quality went from consistently frustrating to consistently acceptable. The "model problem" turned out to be a prompt problem — which is almost always true.

Key takeaways

A system prompt is a product spec in plain English. A vague prompt causes the same pain as a vague PRD — but it reaches every user on every request.
You own the requirements, the engineer owns the implementation. Write acceptance criteria for role, goal, format, and constraints.
Constraints do more work than positive instructions. "Never mention competitors" is clearer and more testable than "stay focused on our product."
One example beats 100 words of instruction. If your prompt describes the format but doesn't show it, add an example.
Treat prompt changes like code changes. Version, eval, test, log. Never edit production without measuring.

Knowledge Check

1.You're writing a system prompt for an AI customer support agent. Which five elements should every production system prompt include?

2.A colleague's prompt produces good results 70% of the time but wildly wrong results 30% of the time. Which techniques would you try first to improve consistency?

3.What is a few-shot prompt, and when can adding examples introduce new failure modes?

4.Your team swaps the underlying model from GPT-4o to Claude 3.5 Sonnet using the same prompt. Outputs change noticeably. Why, and what is your validation process before relaunching?

A two-sentence prompt is a two-sentence spec

Here's the system prompt for a real AI feature that went to production:

💭You're Probably Wondering…

"You are a helpful assistant. Reply to customer emails."

That's it. The entire product spec for a feature that runs on every customer interaction. No role definition. No format rules. No constraints. No examples.

What happened? The bot wrote 500-word replies. Used the wrong tone. Occasionally recommended competitor products. The team spent three weeks debugging "the model" before someone looked at the prompt.

The four ingredients (same as engineering — but you own the spec)

Engineers write the prompt. You write the requirements for the prompt. Same four ingredients, different ownership:

Ingredient	Engineering owns	PM owns
Role	Implementation	Who should this AI be? What persona?
Goal	Implementation	What specific task should it accomplish?
Format	Implementation	How long? What structure? What tone?
Constraints	Implementation	What should it NEVER do?

💭You're Probably Wondering…

There Are No Dumb Questions

"Should I write the system prompt myself?"

"How do I know if the prompt is good enough?"

You eval it. Score 50-200 outputs on the criteria you care about (accuracy, tone, length, safety). If it meets your bar, ship it. If not, iterate. Numbers, not vibes.

⚡

Write the Acceptance Criteria

25 XP

Real example: Sofia fixes Loom's summary feature

Sofia is a PM at Loom (the screen recording tool). Her team's "draft summary" feature generates text summaries of video recordings. The v1 prompt? "Summarise this video transcript."

She iterated through four versions, running an eval (3 reviewers, 200 recordings, usefulness rated 1-5) after each:

Version	What changed	Score	What happened
V1	`"Summarise this video transcript."`	2.9/5	Too long, included timestamps, described screen content verbatim
V2	Added role + format: `"You are a meeting notes assistant. Write a 3-bullet summary. Each bullet starts with an action verb."`	3.8/5	Better structure, but still narrating the video
V3	Added constraints: `"Never mention timestamps. Focus on decisions and action items, not activities."`	4.5/5	Much better — but tone still inconsistent
V4	Added one example of a good summary	4.7/5	Nailed it

Three things to notice:

The biggest quality jumps came from constraints and examples, not from elaborate descriptions. "Never mention timestamps" did more work than 50 words of positive instruction.
One example beat 100 words of instructions. V4 added a single example showing what a good summary looks like. That one example did more for quality than all three previous iterations combined.

This technique — providing worked examples in the prompt — is called few-shot prompting. Zero examples = zero-shot; one = one-shot; two to five = few-shot. Few-shot examples are the fastest way to say "produce output exactly like this." The failure mode: if your examples share an accidental pattern (e.g., all are about software), the model may overgeneralise and apply that pattern to inputs where it doesn't fit.
The whole iteration took 4 hours. The team's alternative plan (fine-tuning the model) was estimated at 6 weeks.

💭You're Probably Wondering…

There Are No Dumb Questions

"What if the engineer pushes back and says the prompt is 'good enough'?"

Versioning: treat prompts like code, not drafts

This is the mistake that bites teams: an engineer edits the system prompt directly in production on a Friday afternoon. Nobody notices until Monday when quality drops.

The PM rule: system prompt changes follow the same process as code changes.

Rule	Why
Version every prompt in source control	So you can roll back to the version that worked
Run an eval before AND after every change	So you know if quality went up or down
Never edit in production without testing	Same reason you don't push code without CI
Log which prompt version each user saw	So when a user reports a problem, you can trace it

🔑Prompt iteration is product iteration

⚡

Prompt Spec Review

50 XP

Key takeaways

A system prompt is a product spec in plain English. A vague prompt causes the same pain as a vague PRD — but it reaches every user on every request.
You own the requirements, the engineer owns the implementation. Write acceptance criteria for role, goal, format, and constraints.
Constraints do more work than positive instructions. "Never mention competitors" is clearer and more testable than "stay focused on our product."
One example beats 100 words of instruction. If your prompt describes the format but doesn't show it, add an example.
Treat prompt changes like code changes. Version, eval, test, log. Never edit production without measuring.

Knowledge Check

1.You're writing a system prompt for an AI customer support agent. Which five elements should every production system prompt include?

2.A colleague's prompt produces good results 70% of the time but wildly wrong results 30% of the time. Which techniques would you try first to improve consistency?

3.What is a few-shot prompt, and when can adding examples introduce new failure modes?

4.Your team swaps the underlying model from GPT-4o to Claude 3.5 Sonnet using the same prompt. Outputs change noticeably. Why, and what is your validation process before relaunching?