Production AI — Latency, Cost, and Observability — Building AI-Powered Products

The $4,200 surprise nobody saw coming

Priya's AI support bot worked perfectly. Users loved it. Engineers moved on to the next feature. Then the January AWS bill arrived: $4,200. Up from $800 in October. Nobody noticed because the feature "worked fine."

What happened? Every request was sending the same 8,000-token system prompt — and traffic had quietly tripled since launch. The model was doing the same work over and over, and Priya's team was paying full price every time.

The lesson: "It works" and "it's ready for production" are two very different things. Production means watching three things: how fast (latency), how much (cost), and how good (quality). Miss any one and you're flying blind.

The three dashboards you need before you ship

Every production AI system needs three observability layers. Think of them like the dashboard gauges in a car — speedometer, fuel gauge, and engine temperature. Any one can look fine while another is about to explode.

Why you need all three:

Quality drops while latency and cost stay flat → the model was silently updated by the provider
Cost spikes while quality stays fine → traffic grew or someone changed the prompt length
Latency jumps while cost stays flat → the provider is overloaded

The on-call rule: Before closing any AI incident, check ALL THREE dashboards. A quality regression hides behind stable latency. A cost spike hides behind stable quality.

Production AI System Architecture

User Input

Input Guardrail

LLM

Output Guardrail

Response

Monitor & Log

💭You're Probably Wondering…

There Are No Dumb Questions

"What's p50, p95, p99?"

These are percentiles. p50 = the median (half of requests are faster, half are slower). p95 = 95% of requests are faster than this number. p99 = 99% are faster. You care about p95 and p99 because those are the worst experiences your users actually have — the median looks fine even when 5% of users are waiting 15 seconds.

"What's 'observability'?"

It's just a fancy word for "being able to see what's happening inside your system." Logs, metrics, dashboards, alerts — anything that lets you answer "why is this broken?" without guessing.

Cutting costs in the right order

Not all optimisations are equal. Here's Priya's playbook, ranked by impact:

The rule: Tier 1 changes deliver 10-100× more savings than all Tier 3 changes combined. Always start at the top.

Tier 1a: Prompt caching — stop paying for the same prompt twice

If your system prompt is the same on every request (and it usually is), you're paying full price to process identical text thousands of times a day. Prompt caching tells the API: "Hey, you've seen this prompt before — reuse the processed version."

Requirements:

Prompt must be at least 1,024 tokens (most system prompts are)
Works on Anthropic (90% discount — cached tokens cost 10% of standard rate) and OpenAI (50% discount on cached tokens — verify current rates at each provider's pricing page)

Priya's system prompt was 8,000 tokens, repeated on every request. With caching, she hit a 90% cache rate. Savings: ~$1,890/month (based on early 2025 rates)

Tier 1b: Model routing — stop paying specialist prices for basic tasks

You learned this in the LLM Fundamentals module, but here's how it plays out in production. Priya added a lightweight classifier that tags each incoming message:

Category	% of traffic	Model	Cost per request
Simple (FAQ, status check)	60%	Haiku	$0.003 (approximate — check Anthropic's pricing page for current rates)
Complex (troubleshooting)	40%	Sonnet	$0.04 (approximate — check Anthropic's pricing page for current rates)

Savings: $840/month.

Tier 2: Reduce RAG context

Priya's retrieval pipeline fetched 8 chunks per query. She ran an eval comparing 8 chunks vs. 4 chunks — quality stayed the same. She cut to 4. Half the context tokens, same quality.

Savings: $420/month.

The final bill

Before	After	Savings
$4,200/month	$1,050/month	75% reduction

(Based on illustrative early 2025 pricing; actual savings depend on current rates)

And the hardest part? Convincing the team to cut RAG chunks before they had eval coverage to prove it was safe. (They built evals first — the Evals module paid for itself.)

⚡

Cost Optimisation Order

25 XP

Rank these optimisations from highest to lowest impact: | Optimisation | Your rank (1-5) | |-------------|----------------| | Remove unnecessary whitespace from prompts | ? | | Enable prompt caching for the 6,000-token system prompt | ? | | Route FAQ questions to Haiku instead of Sonnet | ? | | Reduce RAG chunks from 10 to 5 | ? | | Set max_tokens to prevent 2,000-token runaway responses | ? | _Hint: Think about which optimisations reduce the number of tokens sent to the API vs. which reduce the cost per token. For the 6,000-token system prompt: what happens to that cost if it's repeated on every request versus cached once? That determines which optimisation deserves the top rank._

Latency: why your users are waiting

AI API calls are slow — a typical response takes 1-8 seconds. Users notice. Here's where the time goes:

Phase	What happens	Typical time
Network	Request travels to the API	50-200 ms
Time to First Token (TTFT)	Model processes your input	200-2,000 ms
Token generation	Model outputs tokens one by one	1,000-6,000 ms
Post-processing	Your code parses and processes	10-100 ms

TTFT (Time to First Token) is the most important metric for user-facing features. It's the delay before the first character appears on screen. A 3-second TTFT feels like the app is frozen; a 200ms TTFT feels instant (even if the full response takes 5 seconds) because the user sees text appearing immediately.

How to reduce TTFT:

Shorter input prompts (fewer tokens to process before generating)
Prompt caching (cached prompts process faster)
Model routing (smaller models have faster TTFT)
Streaming (show tokens as they generate instead of waiting for the full response)

💭You're Probably Wondering…

There Are No Dumb Questions

"What's streaming?"

Without streaming, you wait for the entire response to generate, then show it all at once. With streaming, each token appears on screen as soon as it's generated — like watching someone type. Same total wait time, but the perceived wait is much shorter because the user sees progress immediately.

"When does streaming NOT help?"

When your code needs the complete response before it can do anything — like parsing JSON (you need the closing brace before you can parse). Or when the response is very short (nothing meaningful to stream).

⚡

Latency Doctor

50 XP

A Langfuse trace (a tool that records a timeline of every step in your AI pipeline) shows p95 latency of **12 seconds** against a target of under 5 seconds. The breakdown: | Step | Time | |------|------| | Retrieval (vector search) | 200 ms | | Reranking (cross-encoder) | 4,800 ms | | Generation (LLM response) | 7,000 ms | | **Total** | **12,000 ms** | **Your mission:** 1. Which two steps are over budget? 2. For each, propose one specific fix (not "make it faster" — say exactly what you'd change). _Hint: For each over-budget step, ask two questions: (1) Is this step required for correctness, or is it optional optimisation? (2) Can you trade some accuracy for speed — for example by using a faster alternative, reducing scope, or making the wait feel shorter through streaming? Different steps have different levers._

Quality monitoring: catching silent regressions

Here's a nightmare scenario: the AI provider silently updates the model. Your latency and cost dashboards look fine. But response quality has dropped 20% — and you don't find out until a customer complains two weeks later.

The fix: Run an automated quality eval on a sample of live traffic. Every day, score 50-100 responses with your LLM-as-judge (from the Evals module). If the score drops below a threshold, trigger an alert.

Metric	How to measure	Alert threshold
Eval score	LLM-as-judge on 50+ daily samples	Drop > 10% from baseline
User satisfaction	Thumbs up/down on responses	Thumbs down > 15%
Hallucination rate	Automated fact-checking against source docs	Rate > 5%

Back to Priya's $4,200 bill. After the post-mortem, the team made two changes that week: they enabled prompt caching on the 6,000-token system prompt and routed FAQ queries to Haiku. Month 2 bill: $1,050. That's a 75% reduction from two configuration changes — no re-architecture, no new infrastructure. The three dashboards caught a third problem (TTFT spiking when the cache missed) before users noticed. Month 3 was stable. Production AI is mostly instrumentation and configuration.

Key takeaways

Instrument latency, cost, AND quality before you ship. Any one can look fine while another explodes.
Optimise costs in order: Prompt caching first, then model routing, then everything else. Tier 1 delivers 10-100× more savings than Tier 3.
TTFT matters more than total latency for user-facing features. Streaming + shorter prompts + caching reduce it.
Monitor quality continuously. Silent model updates can degrade quality without touching latency or cost metrics.

Knowledge Check

1.Anthropic offers prompt caching. What is the minimum prompt length required to be eligible, and what is the discount on cached input token reads?

2.You serve 10,000 requests/day averaging 800 input tokens and 200 output tokens. Approximately how much would you save monthly by routing all traffic from GPT-4o ($2.50/M input, $10/M output) to GPT-4o-mini ($0.15/M input, $0.60/M output) — OpenAI pricing as of early 2025; verify current rates at openai.com/pricing?

3.What is time-to-first-token (TTFT), why does it matter more than total latency in streaming applications, and which choice most directly reduces it?

4.A production AI feature starts returning lower quality responses after a provider silently updates the underlying model. Which monitoring metric would have surfaced this regression earliest?

The $4,200 surprise nobody saw coming

The three dashboards you need before you ship

Why you need all three:

Quality drops while latency and cost stay flat → the model was silently updated by the provider
Cost spikes while quality stays fine → traffic grew or someone changed the prompt length
Latency jumps while cost stays flat → the provider is overloaded

The on-call rule: Before closing any AI incident, check ALL THREE dashboards. A quality regression hides behind stable latency. A cost spike hides behind stable quality.

Production AI System Architecture

User Input

Input Guardrail

LLM

Output Guardrail

Response

Monitor & Log

💭You're Probably Wondering…

There Are No Dumb Questions

"What's p50, p95, p99?"

"What's 'observability'?"

It's just a fancy word for "being able to see what's happening inside your system." Logs, metrics, dashboards, alerts — anything that lets you answer "why is this broken?" without guessing.

Cutting costs in the right order

Not all optimisations are equal. Here's Priya's playbook, ranked by impact:

The rule: Tier 1 changes deliver 10-100× more savings than all Tier 3 changes combined. Always start at the top.

Tier 1a: Prompt caching — stop paying for the same prompt twice

Requirements:

Prompt must be at least 1,024 tokens (most system prompts are)
Works on Anthropic (90% discount — cached tokens cost 10% of standard rate) and OpenAI (50% discount on cached tokens — verify current rates at each provider's pricing page)

Priya's system prompt was 8,000 tokens, repeated on every request. With caching, she hit a 90% cache rate. Savings: ~$1,890/month (based on early 2025 rates)

Tier 1b: Model routing — stop paying specialist prices for basic tasks

You learned this in the LLM Fundamentals module, but here's how it plays out in production. Priya added a lightweight classifier that tags each incoming message:

Category	% of traffic	Model	Cost per request
Simple (FAQ, status check)	60%	Haiku	$0.003 (approximate — check Anthropic's pricing page for current rates)
Complex (troubleshooting)	40%	Sonnet	$0.04 (approximate — check Anthropic's pricing page for current rates)

Savings: $840/month.

Tier 2: Reduce RAG context

Priya's retrieval pipeline fetched 8 chunks per query. She ran an eval comparing 8 chunks vs. 4 chunks — quality stayed the same. She cut to 4. Half the context tokens, same quality.

Savings: $420/month.

The final bill

Before	After	Savings
$4,200/month	$1,050/month	75% reduction

(Based on illustrative early 2025 pricing; actual savings depend on current rates)

And the hardest part? Convincing the team to cut RAG chunks before they had eval coverage to prove it was safe. (They built evals first — the Evals module paid for itself.)

⚡

Cost Optimisation Order

25 XP

Latency: why your users are waiting

AI API calls are slow — a typical response takes 1-8 seconds. Users notice. Here's where the time goes:

Phase	What happens	Typical time
Network	Request travels to the API	50-200 ms
Time to First Token (TTFT)	Model processes your input	200-2,000 ms
Token generation	Model outputs tokens one by one	1,000-6,000 ms
Post-processing	Your code parses and processes	10-100 ms

How to reduce TTFT:

Shorter input prompts (fewer tokens to process before generating)
Prompt caching (cached prompts process faster)
Model routing (smaller models have faster TTFT)
Streaming (show tokens as they generate instead of waiting for the full response)

💭You're Probably Wondering…

There Are No Dumb Questions

"What's streaming?"

"When does streaming NOT help?"

⚡

Latency Doctor

50 XP

Quality monitoring: catching silent regressions

Metric	How to measure	Alert threshold
Eval score	LLM-as-judge on 50+ daily samples	Drop > 10% from baseline
User satisfaction	Thumbs up/down on responses	Thumbs down > 15%
Hallucination rate	Automated fact-checking against source docs	Rate > 5%

Key takeaways

Instrument latency, cost, AND quality before you ship. Any one can look fine while another explodes.
Optimise costs in order: Prompt caching first, then model routing, then everything else. Tier 1 delivers 10-100× more savings than Tier 3.
TTFT matters more than total latency for user-facing features. Streaming + shorter prompts + caching reduce it.
Monitor quality continuously. Silent model updates can degrade quality without touching latency or cost metrics.

Knowledge Check

1.Anthropic offers prompt caching. What is the minimum prompt length required to be eligible, and what is the discount on cached input token reads?

3.What is time-to-first-token (TTFT), why does it matter more than total latency in streaming applications, and which choice most directly reduces it?

4.A production AI feature starts returning lower quality responses after a provider silently updates the underlying model. Which monitoring metric would have surfaced this regression earliest?