Prompt Design for PMs
Write and version system prompts as product specs — with role, goal, format, and constraints that prevent failure modes before they reach users.
A two-sentence prompt is a two-sentence spec
Here's the system prompt for a real AI feature that went to production:
"You are a helpful assistant. Reply to customer emails."
That's it. The entire product spec for a feature that runs on every customer interaction. No role definition. No format rules. No constraints. No examples.
What happened? The bot wrote 500-word replies. Used the wrong tone. Occasionally recommended competitor products. The team spent three weeks debugging "the model" before someone looked at the prompt.
The lesson: A system prompt IS a product spec, written in plain English. A vague prompt causes the same downstream pain as a vague PRD — except a bad prompt reaches every user on every request until someone catches it.
The four ingredients (same as engineering — but you own the spec)
Engineers write the prompt. You write the requirements for the prompt. Same four ingredients, different ownership:
| Ingredient | Engineering owns | PM owns |
|---|---|---|
| Role | Implementation | Who should this AI be? What persona? |
| Goal | Implementation | What specific task should it accomplish? |
| Format | Implementation | How long? What structure? What tone? |
| Constraints | Implementation | What should it NEVER do? |
Think of it like a job posting. You wouldn't hire someone with just "be helpful and do work." You'd specify the role, the responsibilities, the standards, and the deal-breakers. A system prompt is a job posting for your AI.
PM insight: Constraints cause the most damage when missing. The model locks in its role, goal, and format first — and only then reaches the "what not to do" section. If that section is empty, the model will fill in the blanks with its own defaults. Those defaults are tuned for general helpfulness, not your product's specific needs.
There Are No Dumb Questions
"Should I write the system prompt myself?"
No — but you should write the requirements for it. Just like you don't write the code, but you write the acceptance criteria. "The model should never mention competitor products" is a PM requirement. The engineer translates it into the prompt. Then you eval it.
"How do I know if the prompt is good enough?"
You eval it. Score 50-200 outputs on the criteria you care about (accuracy, tone, length, safety). If it meets your bar, ship it. If not, iterate. Numbers, not vibes.
Write the Acceptance Criteria
25 XPReal example: Sofia fixes Loom's summary feature
Sofia is a PM at Loom (the screen recording tool). Her team's "draft summary" feature generates text summaries of video recordings. The v1 prompt? "Summarise this video transcript."
She iterated through four versions, running an eval (3 reviewers, 200 recordings, usefulness rated 1-5) after each:
| Version | What changed | Score | What happened |
|---|---|---|---|
| V1 | "Summarise this video transcript." | 2.9/5 | Too long, included timestamps, described screen content verbatim |
| V2 | Added role + format: "You are a meeting notes assistant. Write a 3-bullet summary. Each bullet starts with an action verb." | 3.8/5 | Better structure, but still narrating the video |
| V3 | Added constraints: "Never mention timestamps. Focus on decisions and action items, not activities." | 4.5/5 | Much better — but tone still inconsistent |
| V4 | Added one example of a good summary | 4.7/5 | Nailed it |
Three things to notice:
-
The biggest quality jumps came from constraints and examples, not from elaborate descriptions. "Never mention timestamps" did more work than 50 words of positive instruction.
-
One example beat 100 words of instructions. V4 added a single example showing what a good summary looks like. That one example did more for quality than all three previous iterations combined.
This technique — providing worked examples in the prompt — is called few-shot prompting. Zero examples = zero-shot; one = one-shot; two to five = few-shot. Few-shot examples are the fastest way to say "produce output exactly like this." The failure mode: if your examples share an accidental pattern (e.g., all are about software), the model may overgeneralise and apply that pattern to inputs where it doesn't fit.
-
The whole iteration took 4 hours. The team's alternative plan (fine-tuning the model) was estimated at 6 weeks.
There Are No Dumb Questions
"What if the engineer pushes back and says the prompt is 'good enough'?"
Ask for the eval numbers. "Good enough" without data is a gut feeling. Show them a low-scoring output from the eval set and ask: "Would you ship this to a customer?" The conversation shifts from opinion to evidence.
Versioning: treat prompts like code, not drafts
This is the mistake that bites teams: an engineer edits the system prompt directly in production on a Friday afternoon. Nobody notices until Monday when quality drops.
The PM rule: system prompt changes follow the same process as code changes.
| Rule | Why |
|---|---|
| Version every prompt in source control | So you can roll back to the version that worked |
| Run an eval before AND after every change | So you know if quality went up or down |
| Never edit in production without testing | Same reason you don't push code without CI |
| Log which prompt version each user saw | So when a user reports a problem, you can trace it |
Prompt Spec Review
50 XPBack to that two-sentence system prompt. After three weeks of debugging "the model," the team rewrote the system prompt over a single afternoon. They added a role, a format constraint (max three sentences per reply), a constraint list (no competitor mentions, no unsolicited upsells), and one example of a good response. They ran an eval over 100 historical tickets. Quality went from consistently frustrating to consistently acceptable. The "model problem" turned out to be a prompt problem — which is almost always true.
Key takeaways
- A system prompt is a product spec in plain English. A vague prompt causes the same pain as a vague PRD — but it reaches every user on every request.
- You own the requirements, the engineer owns the implementation. Write acceptance criteria for role, goal, format, and constraints.
- Constraints do more work than positive instructions. "Never mention competitors" is clearer and more testable than "stay focused on our product."
- One example beats 100 words of instruction. If your prompt describes the format but doesn't show it, add an example.
- Treat prompt changes like code changes. Version, eval, test, log. Never edit production without measuring.
Knowledge Check
1.You're writing a system prompt for an AI customer support agent. Which five elements should every production system prompt include?
2.A colleague's prompt produces good results 70% of the time but wildly wrong results 30% of the time. Which techniques would you try first to improve consistency?
3.What is a few-shot prompt, and when can adding examples introduce new failure modes?
4.Your team swaps the underlying model from GPT-4o to Claude 3.5 Sonnet using the same prompt. Outputs change noticeably. Why, and what is your validation process before relaunching?