LLM Fundamentals for PMs
Learn what an LLM can and can't do reliably so you can decide which problems are worth putting in front of it.
The $300,000 mistake you're about to avoid
A fintech team shipped an AI financial-advice feature to 50,000 users. Nobody checked whether the model was actually reliable at giving financial advice. Spoiler: it wasn't. The model invented regulations, misread risk profiles, and gave confident-sounding advice that was flat-out wrong. The company spent three months in damage control — legal fees, user refunds, brand damage.
Who owned that decision? Not the engineers. The PM greenlit it.
As a PM, you decide which problems to put in front of an LLM. That means you own the reliability question. This module gives you a simple framework to answer it — before you ship, not after.
(Illustrative scenario. The pattern — teams shipping AI features without validating reliability, then facing costly remediation — is well-documented across AI product launches.)
What an LLM actually does (the 30-second version)
You don't need to know how the engine works to drive a car. But you do need to know what the car is good at and what it's terrible at.
Here's all you need to know: An LLM (Large Language Model) predicts the next word, over and over, until it has a complete response. It doesn't "know" things. It doesn't "think." It predicts what text is most likely to come next based on patterns it learned from reading billions of web pages.
This explains everything:
- Why it's great at summarising — summaries follow predictable patterns, and it's seen millions of them
- Why it makes stuff up — if the most likely next word leads to a false statement, it'll say it with full confidence
- Why it's inconsistent — ask the same question twice, get different answers, because each prediction has some randomness built in
There Are No Dumb Questions
"Do I really need to understand how an LLM works to be a good PM?"
You don't need to understand transformer architecture. But you DO need to understand what makes some tasks reliable and others risky — because that's how you decide what to build. This module gives you exactly that.
The PM's decision framework: the reliability quadrant
You've used 2×2 priority matrices for roadmap planning — impact vs. effort. This is the same idea, but the axes are value and reliability.
Here's how to read each quadrant:
| Quadrant | Value | Reliability | PM action |
|---|---|---|---|
| Q1: Build now | High | High | Ship it. Text summarisation, code completion — the model does these well. |
| Q2: Augment with review | Lower | High | Build if resources allow. Meeting notes, brainstorming — useful but not critical. |
| Q3: Don't bother | Low | Low | Skip. Low value AND unreliable? Not worth your team's time. |
| Q4: Needs safeguards | High | Low | Build with a human review gate. Medical diagnosis, financial advice — valuable but the model makes dangerous mistakes. |
The key insight: Q4 is where PMs get burned. The value is tempting. The CEO is excited. But shipping without safeguards is the $300,000 mistake at the top of this page.
Quadrant Quick Sort
25 XP2. Generate personalised investment recommendations →
A real PM uses the quadrant: Daniela's Q3 planning
Daniela is a PM at a Series B fintech. Q3 planning: four AI features, one engineering team, and a CEO who wants "AI" in the release notes.
Without a framework, she'd default to whichever feature her eng lead finds most interesting. Instead, she maps each feature on the quadrant.
Feature 1: Auto-categorise transactions
Value: High — customers have been requesting this for months. Reliability: High — classification is a well-understood task for LLMs. Quadrant: Q1 — Build now.
Result: Ships six weeks later. 94% accuracy in production. Customers love it.
Feature 2: Personalised financial advice
Value: High — would be a major differentiator. Reliability: Low — LLMs invent regulations, misread risk profiles, and carry no accountability. Quadrant: Q4 — Needs safeguards.
PM decision: Daniela doesn't block it. She doesn't ship it without review either. She adds a mandatory gate: a licensed financial advisor approves every AI-generated recommendation before it reaches the customer. Adds two days to response time. Prevents the regulatory disaster that sank a competitor that year.
Feature 3: Meeting notes summariser (internal)
Value: Moderate — saves time but not customer-facing. Reliability: High — summarisation is a strong suit for LLMs. Quadrant: Q2 — Augment with review.
PM decision: Schedules for a six-week build. Low stakes means low urgency.
Feature 4: Customer churn prediction
Value: High — could save millions in retention. Reliability: Unknown — no eval suite exists yet. Quadrant: ??? — Can't place it until reliability is measured.
PM decision: Blocks the roadmap item until the data team runs an accuracy study. Two months later, the study returns 71% precision — good enough to ship with a confidence threshold, which she writes into the spec.
There Are No Dumb Questions
"What if reliability is unknown? How do I place it on the quadrant?"
You don't. That's the whole point. If you can't answer "how reliable is this for our specific use case?", you don't have enough information to make a product decision. Your action item: commission an eval (a set of test cases that measure accuracy). Place the feature on the quadrant only after you have data.
"Can a feature move between quadrants?"
Absolutely. Reliability improves with better prompts, RAG (retrieval-augmented generation — a system that fetches real documents and hands them to the model), and proper evals. A feature in Q4 today might move to Q1 in six months. That's why you revisit deferred features on a cadence, not close them permanently.
Be the PM: Priya's Legal Startup
50 XPFive terms you'll hear in every AI meeting
You don't need deep technical knowledge, but you need to know these five terms well enough to ask smart questions and spot bad ideas.
1. Tokens
The pieces an LLM reads and writes. Not words — smaller than words. "Hello, world!" = 4 tokens. Costs are priced per token, not per word. Why you care: Token counts determine cost and speed. More tokens = more expensive and slower.
2. Context window
The model's short-term memory — the maximum tokens it can process in a single call. If your input + output exceeds the window, the call fails. Why you care: This limits how much data you can send in one go. A 128k-token window holds roughly 90,000–100,000 words.
3. Temperature
A dial from 0 to 1+ that controls randomness. Low temperature = predictable, repetitive responses. High temperature = creative, unpredictable responses. Why you care: If users complain about inconsistent answers, temperature might be too high. If they complain about robotic answers, it might be too low.
Creative — more surprising word choices. Good for brainstorming and writing.
4. Hallucination (confabulation)
When the model generates confident-sounding text that's factually wrong. It doesn't "know" it's wrong — it just predicted the most likely next tokens, and they happened to be false. Why you care: This is the #1 risk for high-stakes features. If your AI feature involves facts (legal, medical, financial), you need a plan for hallucination.
5. RAG (Retrieval-Augmented Generation)
A system that fetches relevant real documents and hands them to the model before it generates a response. Instead of relying on what the model "remembers" from training, you give it actual source material. Why you care: RAG is often the difference between a Q4 feature (unreliable) and a Q1 feature (reliable). Ask your engineers: "Are we using RAG, and what documents are we retrieving?"
RAG vs. fine-tuning: You'll also hear "fine-tuning" proposed as an alternative — training the model further on your data so the knowledge becomes baked into its weights. Fine-tuning changes how the model behaves permanently; RAG gives it access to documents at the moment of each query. Fine-tuning costs vary enormously by approach: open-source fine-tuning with LoRA/PEFT methods can cost $500–$5K; fine-tuning frontier models through provider APIs varies widely — check provider pricing pages for current rates, as costs have dropped significantly; full custom enterprise ML engagements can still run $50K–$500K+. The process typically takes days to weeks. RAG can be built in one sprint and keeps knowledge updatable. For most product decisions, start with RAG. Revisit fine-tuning if RAG hits its ceiling.
Translate the engineer-speak
25 XPThe PM's checklist: before you greenlight any AI feature
Before you say "yes" to an AI feature, ask these five questions:
- Where does this sit on the quadrant? High value + low reliability = mandatory human review gate.
- What happens when it's wrong? If a wrong answer causes financial, legal, or health harm → Q4 safeguards, no exceptions.
- Do we have an eval? If you can't measure reliability, you can't make a product decision. Commission the eval first.
- Are we using the right model tier? Not every task needs the most expensive model. Ask your engineer about model routing.
- Can reliability improve? If yes, schedule a re-evaluation in 3-6 months. Don't permanently close Q4 features — they might graduate to Q1.
Back to that fintech team. Three months of damage control and $300,000 later, they added a mandatory human review gate before any AI-generated recommendation reached a customer. The product that could have launched right the first time spent six months being un-broken. The reliability quadrant would have put this feature in Q4 on day one — and saved the whole disaster.
Key takeaways
- You own the reliability decision. The PM decides which problems to put in front of an LLM — and which to keep away from it.
- Use the quadrant as a go/no-go gate. High value + low reliability = human-in-the-loop before shipping. No exceptions.
- Same model, different reliability. A model that's reliable at meeting summaries may be unreliable at financial advice. Check reliability for each specific domain.
- Unknown reliability = stop. If you can't measure it, you can't ship it. Commission an eval first.
- Q4 features can graduate. Revisit them on a cadence — better prompts, RAG, and evals can move features up the reliability axis.
Knowledge Check
1.A foundation model is described as having a 128k token context window. What does this mean for a document summarization feature you're planning?
2.What is the difference between a fine-tuned model and a RAG (retrieval-augmented generation) system? When would you choose one over the other?
3.Your engineer says the model's "temperature is set too high." What user-facing behavior would you expect to see, and why might that be a problem?
4.A PM is evaluating an AI feature that generates personalised investment recommendations. Business value is high. The engineering team says reliability in this domain is low — the model invents regulations and misreads risk profiles. According to the reliability quadrant in this module, what should the PM do?