LLM Fundamentals for PMs — Leading AI Products

The $300,000 mistake you're about to avoid

A fintech team shipped an AI financial-advice feature to 50,000 users. Nobody checked whether the model was actually reliable at giving financial advice. Spoiler: it wasn't. The model invented regulations, misread risk profiles, and gave confident-sounding advice that was flat-out wrong. The company spent three months in damage control — legal fees, user refunds, brand damage.

Who owned that decision? Not the engineers. The PM greenlit it.

As a PM, you decide which problems to put in front of an LLM. That means you own the reliability question. This module gives you a simple framework to answer it — before you ship, not after.

(Illustrative scenario. The pattern — teams shipping AI features without validating reliability, then facing costly remediation — is well-documented across AI product launches.)

What an LLM actually does (the 30-second version)

You don't need to know how the engine works to drive a car. But you do need to know what the car is good at and what it's terrible at.

Here's all you need to know: An LLM (Large Language Model) predicts the next word, over and over, until it has a complete response. It doesn't "know" things. It doesn't "think." It predicts what text is most likely to come next based on patterns it learned from reading billions of web pages.

This explains everything:

Why it's great at summarising — summaries follow predictable patterns, and it's seen millions of them
Why it makes stuff up — if the most likely next word leads to a false statement, it'll say it with full confidence
Why it's inconsistent — ask the same question twice, get different answers, because each prediction has some randomness built in

💭You're Probably Wondering…

There Are No Dumb Questions

"Do I really need to understand how an LLM works to be a good PM?"

You don't need to understand transformer architecture. But you DO need to understand what makes some tasks reliable and others risky — because that's how you decide what to build. This module gives you exactly that.

The PM's decision framework: the reliability quadrant

You've used 2×2 priority matrices for roadmap planning — impact vs. effort. This is the same idea, but the axes are value and reliability.

Here's how to read each quadrant:

Quadrant	Value	Reliability	PM action
Q1: Build now	High	High	Ship it. Text summarisation, code completion — the model does these well.
Q2: Augment with review	Lower	High	Build if resources allow. Meeting notes, brainstorming — useful but not critical.
Q3: Don't bother	Low	Low	Skip. Low value AND unreliable? Not worth your team's time.
Q4: Needs safeguards	High	Low	Build with a human review gate. Medical diagnosis, financial advice — valuable but the model makes dangerous mistakes.

The key insight: Q4 is where PMs get burned. The value is tempting. The CEO is excited. But shipping without safeguards is the $300,000 mistake at the top of this page.

🚨Shipping Q4 features without safeguards

High value + low reliability = a feature that will cause real harm when the model is wrong. The fintech team that shipped AI financial advice without review spent three months in damage control — legal fees, user refunds, brand damage. A human review gate isn't bureaucracy. It's the thing that keeps a promising feature from becoming a company-threatening incident.

⚡

Quadrant Quick Sort

25 XP

Q1Q2Q3Q4

Auto-summarise customer support tickets

Generate personalised investment recommendations

Translate internal memos from English to Spanish

Diagnose medical symptoms from patient descriptions

Generate fun team-building trivia questions

2. Generate personalised investment recommendations →

0/5 answered

A real PM uses the quadrant: Daniela's Q3 planning

Daniela is a PM at a Series B fintech. Q3 planning: four AI features, one engineering team, and a CEO who wants "AI" in the release notes.

Without a framework, she'd default to whichever feature her eng lead finds most interesting. Instead, she maps each feature on the quadrant.

Feature 1: Auto-categorise transactions

Value: High — customers have been requesting this for months. Reliability: High — classification is a well-understood task for LLMs. Quadrant: Q1 — Build now.

Result: Ships six weeks later. 94% accuracy in production. Customers love it.

Feature 2: Personalised financial advice

Value: High — would be a major differentiator. Reliability: Low — LLMs invent regulations, misread risk profiles, and carry no accountability. Quadrant: Q4 — Needs safeguards.

PM decision: Daniela doesn't block it. She doesn't ship it without review either. She adds a mandatory gate: a licensed financial advisor approves every AI-generated recommendation before it reaches the customer. Adds two days to response time. Prevents the regulatory disaster that sank a competitor that year.

Feature 3: Meeting notes summariser (internal)

Value: Moderate — saves time but not customer-facing. Reliability: High — summarisation is a strong suit for LLMs. Quadrant: Q2 — Augment with review.

PM decision: Schedules for a six-week build. Low stakes means low urgency.

Feature 4: Customer churn prediction

Value: High — could save millions in retention. Reliability: Unknown — no eval suite exists yet. Quadrant: ??? — Can't place it until reliability is measured.

PM decision: Blocks the roadmap item until the data team runs an accuracy study. Two months later, the study returns 71% precision — good enough to ship with a confidence threshold, which she writes into the spec.

💭You're Probably Wondering…

There Are No Dumb Questions

"What if reliability is unknown? How do I place it on the quadrant?"

You don't. That's the whole point. If you can't answer "how reliable is this for our specific use case?", you don't have enough information to make a product decision. Your action item: commission an eval (a set of test cases that measure accuracy). Place the feature on the quadrant only after you have data.

"Can a feature move between quadrants?"

Absolutely. Reliability improves with better prompts, RAG (retrieval-augmented generation — a system that fetches real documents and hands them to the model), and proper evals. A feature in Q4 today might move to Q1 in six months. That's why you revisit deferred features on a cadence, not close them permanently.

⚡

Be the PM: Priya's Legal Startup

50 XP

Priya is a PM at a legal research startup. Her team is evaluating three AI features: - **Auto-cite:** Given a legal brief, automatically insert citations to case law - **Meeting notes:** Transcribe and summarise legal team meetings - **Deadline tracker:** Predict which cases are at risk of missing filing deadlines based on historical patterns **Your mission:** 1. For each feature, answer: Is the value high or low? Is the reliability likely high or low for a *legal* context? 2. Place each feature in a quadrant (Q1, Q2, Q3, or Q4). 3. For the feature you place in Q4, describe one specific human review gate Priya should add before any output reaches a client. | Feature | Value | Reliability | Quadrant | Why | |---------|-------|-------------|----------|-----| | Auto-cite | ? | ? | ? | ? | | Meeting notes | ? | ? | ? | ? | | Deadline tracker | ? | ? | ? | ? | _Hint: Meeting notes is the easy one — moderate value, high reliability, low stakes if wrong. Start there. For Auto-cite, think about what happens if the model cites a case that doesn't exist (this happens often with LLMs). For Deadline tracker, think about whether this is an LLM task at all or more of a traditional ML task._

Five terms you'll hear in every AI meeting

You don't need deep technical knowledge, but you need to know these five terms well enough to ask smart questions and spot bad ideas.

RoughDiagram: invalid JSON

1. Tokens

The pieces an LLM reads and writes. Not words — smaller than words. "Hello, world!" = 4 tokens. Costs are priced per token, not per word. Why you care: Token counts determine cost and speed. More tokens = more expensive and slower.

2. Context window

The model's short-term memory — the maximum tokens it can process in a single call. If your input + output exceeds the window, the call fails. Why you care: This limits how much data you can send in one go. A 128k-token window holds roughly 90,000–100,000 words.

3. Temperature

A dial from 0 to 1+ that controls randomness. Low temperature = predictable, repetitive responses. High temperature = creative, unpredictable responses. Why you care: If users complain about inconsistent answers, temperature might be too high. If they complain about robotic answers, it might be too low.

temperature1

Creative — more surprising word choices. Good for brainstorming and writing.

🔑What this means for your product

If you're building a customer support bot, you want temperature near 0 — consistent, predictable answers. If you're building a creative brainstorming tool, temperature near 1 gives variety. You control this in the API. Your engineering team sets it — your job is to decide what the product needs.

4. Hallucination (confabulation)

When the model generates confident-sounding text that's factually wrong. It doesn't "know" it's wrong — it just predicted the most likely next tokens, and they happened to be false. Why you care: This is the #1 risk for high-stakes features. If your AI feature involves facts (legal, medical, financial), you need a plan for hallucination.

5. RAG (Retrieval-Augmented Generation)

A system that fetches relevant real documents and hands them to the model before it generates a response. Instead of relying on what the model "remembers" from training, you give it actual source material. Why you care: RAG is often the difference between a Q4 feature (unreliable) and a Q1 feature (reliable). Ask your engineers: "Are we using RAG, and what documents are we retrieving?"

RAG vs. fine-tuning: You'll also hear "fine-tuning" proposed as an alternative — training the model further on your data so the knowledge becomes baked into its weights. Fine-tuning changes how the model behaves permanently; RAG gives it access to documents at the moment of each query. Fine-tuning costs vary enormously by approach: open-source fine-tuning with LoRA/PEFT methods can cost $500–$5K; fine-tuning frontier models through provider APIs varies widely — check provider pricing pages for current rates, as costs have dropped significantly; full custom enterprise ML engagements can still run $50K–$500K+. The process typically takes days to weeks. RAG can be built in one sprint and keeps knowledge updatable. For most product decisions, start with RAG. Revisit fine-tuning if RAG hits its ceiling.

⚡

Translate the engineer-speak

25 XP

Your engineer says these things in a meeting. What do they mean, and what should you ask next? 1. **"The model's temperature is set too high."** - What's happening to the user experience? ___ - Your follow-up question: ___ 2. **"We're hitting the context window limit on longer documents."** - What's happening to users? ___ - Your follow-up question: ___ 3. **"The model is hallucinating case citations."** - What's the risk? ___ - Your follow-up question: ___ _Hint: For #1, think about what high temperature does to consistency. For #2, think about what happens when the limit is exceeded. For #3, think about what a fake case citation means for a legal product._

The PM's checklist: before you greenlight any AI feature

Before you say "yes" to an AI feature, ask these five questions:

Where does this sit on the quadrant? High value + low reliability = mandatory human review gate.
What happens when it's wrong? If a wrong answer causes financial, legal, or health harm → Q4 safeguards, no exceptions.
Do we have an eval? If you can't measure reliability, you can't make a product decision. Commission the eval first.
Are we using the right model tier? Not every task needs the most expensive model. Ask your engineer about model routing.
Can reliability improve? If yes, schedule a re-evaluation in 3-6 months. Don't permanently close Q4 features — they might graduate to Q1.

Back to that fintech team. Three months of damage control and $300,000 later, they added a mandatory human review gate before any AI-generated recommendation reached a customer. The product that could have launched right the first time spent six months being un-broken. The reliability quadrant would have put this feature in Q4 on day one — and saved the whole disaster.

Key takeaways

You own the reliability decision. The PM decides which problems to put in front of an LLM — and which to keep away from it.
Use the quadrant as a go/no-go gate. High value + low reliability = human-in-the-loop before shipping. No exceptions.
Same model, different reliability. A model that's reliable at meeting summaries may be unreliable at financial advice. Check reliability for each specific domain.
Unknown reliability = stop. If you can't measure it, you can't ship it. Commission an eval first.
Q4 features can graduate. Revisit them on a cadence — better prompts, RAG, and evals can move features up the reliability axis.

Knowledge Check

1.A foundation model is described as having a 128k token context window. What does this mean for a document summarization feature you're planning?

2.What is the difference between a fine-tuned model and a RAG (retrieval-augmented generation) system? When would you choose one over the other?

3.Your engineer says the model's "temperature is set too high." What user-facing behavior would you expect to see, and why might that be a problem?

4.A PM is evaluating an AI feature that generates personalised investment recommendations. Business value is high. The engineering team says reliability in this domain is low — the model invents regulations and misreads risk profiles. According to the reliability quadrant in this module, what should the PM do?

The $300,000 mistake you're about to avoid

Who owned that decision? Not the engineers. The PM greenlit it.

As a PM, you decide which problems to put in front of an LLM. That means you own the reliability question. This module gives you a simple framework to answer it — before you ship, not after.

(Illustrative scenario. The pattern — teams shipping AI features without validating reliability, then facing costly remediation — is well-documented across AI product launches.)

What an LLM actually does (the 30-second version)

You don't need to know how the engine works to drive a car. But you do need to know what the car is good at and what it's terrible at.

This explains everything:

Why it's great at summarising — summaries follow predictable patterns, and it's seen millions of them
Why it makes stuff up — if the most likely next word leads to a false statement, it'll say it with full confidence
Why it's inconsistent — ask the same question twice, get different answers, because each prediction has some randomness built in

💭You're Probably Wondering…

There Are No Dumb Questions

"Do I really need to understand how an LLM works to be a good PM?"

The PM's decision framework: the reliability quadrant

You've used 2×2 priority matrices for roadmap planning — impact vs. effort. This is the same idea, but the axes are value and reliability.

Here's how to read each quadrant:

Quadrant	Value	Reliability	PM action
Q1: Build now	High	High	Ship it. Text summarisation, code completion — the model does these well.
Q2: Augment with review	Lower	High	Build if resources allow. Meeting notes, brainstorming — useful but not critical.
Q3: Don't bother	Low	Low	Skip. Low value AND unreliable? Not worth your team's time.
Q4: Needs safeguards	High	Low	Build with a human review gate. Medical diagnosis, financial advice — valuable but the model makes dangerous mistakes.

The key insight: Q4 is where PMs get burned. The value is tempting. The CEO is excited. But shipping without safeguards is the $300,000 mistake at the top of this page.

🚨Shipping Q4 features without safeguards

⚡

Quadrant Quick Sort

25 XP

Q1Q2Q3Q4

Auto-summarise customer support tickets

Generate personalised investment recommendations

Translate internal memos from English to Spanish

Diagnose medical symptoms from patient descriptions

Generate fun team-building trivia questions

2. Generate personalised investment recommendations →

0/5 answered

A real PM uses the quadrant: Daniela's Q3 planning

Daniela is a PM at a Series B fintech. Q3 planning: four AI features, one engineering team, and a CEO who wants "AI" in the release notes.

Without a framework, she'd default to whichever feature her eng lead finds most interesting. Instead, she maps each feature on the quadrant.

Feature 1: Auto-categorise transactions

Value: High — customers have been requesting this for months. Reliability: High — classification is a well-understood task for LLMs. Quadrant: Q1 — Build now.

Result: Ships six weeks later. 94% accuracy in production. Customers love it.

Feature 2: Personalised financial advice

Value: High — would be a major differentiator. Reliability: Low — LLMs invent regulations, misread risk profiles, and carry no accountability. Quadrant: Q4 — Needs safeguards.

Feature 3: Meeting notes summariser (internal)

Value: Moderate — saves time but not customer-facing. Reliability: High — summarisation is a strong suit for LLMs. Quadrant: Q2 — Augment with review.

PM decision: Schedules for a six-week build. Low stakes means low urgency.

Feature 4: Customer churn prediction

Value: High — could save millions in retention. Reliability: Unknown — no eval suite exists yet. Quadrant: ??? — Can't place it until reliability is measured.

💭You're Probably Wondering…

There Are No Dumb Questions

"What if reliability is unknown? How do I place it on the quadrant?"

"Can a feature move between quadrants?"

⚡

Be the PM: Priya's Legal Startup

50 XP

Five terms you'll hear in every AI meeting

You don't need deep technical knowledge, but you need to know these five terms well enough to ask smart questions and spot bad ideas.

RoughDiagram: invalid JSON

1. Tokens

2. Context window

3. Temperature

temperature1

Creative — more surprising word choices. Good for brainstorming and writing.

🔑What this means for your product

4. Hallucination (confabulation)

5. RAG (Retrieval-Augmented Generation)

⚡

Translate the engineer-speak

25 XP

The PM's checklist: before you greenlight any AI feature

Before you say "yes" to an AI feature, ask these five questions:

Where does this sit on the quadrant? High value + low reliability = mandatory human review gate.
What happens when it's wrong? If a wrong answer causes financial, legal, or health harm → Q4 safeguards, no exceptions.
Do we have an eval? If you can't measure reliability, you can't make a product decision. Commission the eval first.
Are we using the right model tier? Not every task needs the most expensive model. Ask your engineer about model routing.
Can reliability improve? If yes, schedule a re-evaluation in 3-6 months. Don't permanently close Q4 features — they might graduate to Q1.

Key takeaways

You own the reliability decision. The PM decides which problems to put in front of an LLM — and which to keep away from it.
Use the quadrant as a go/no-go gate. High value + low reliability = human-in-the-loop before shipping. No exceptions.
Same model, different reliability. A model that's reliable at meeting summaries may be unreliable at financial advice. Check reliability for each specific domain.
Unknown reliability = stop. If you can't measure it, you can't ship it. Commission an eval first.
Q4 features can graduate. Revisit them on a cadence — better prompts, RAG, and evals can move features up the reliability axis.

Knowledge Check

1.A foundation model is described as having a 128k token context window. What does this mean for a document summarization feature you're planning?

2.What is the difference between a fine-tuned model and a RAG (retrieval-augmented generation) system? When would you choose one over the other?

3.Your engineer says the model's "temperature is set too high." What user-facing behavior would you expect to see, and why might that be a problem?