Production AI — Latency, Cost, and Observability
Instrument your AI system for latency, cost, and quality, then use prompt caching and model routing to cut costs by 75%.
The $4,200 surprise nobody saw coming
Priya's AI support bot worked perfectly. Users loved it. Engineers moved on to the next feature. Then the January AWS bill arrived: $4,200. Up from $800 in October. Nobody noticed because the feature "worked fine."
What happened? Every request was sending the same 8,000-token system prompt — and traffic had quietly tripled since launch. The model was doing the same work over and over, and Priya's team was paying full price every time.
The lesson: "It works" and "it's ready for production" are two very different things. Production means watching three things: how fast (latency), how much (cost), and how good (quality). Miss any one and you're flying blind.
The three dashboards you need before you ship
Every production AI system needs three observability layers. Think of them like the dashboard gauges in a car — speedometer, fuel gauge, and engine temperature. Any one can look fine while another is about to explode.
Why you need all three:
- Quality drops while latency and cost stay flat → the model was silently updated by the provider
- Cost spikes while quality stays fine → traffic grew or someone changed the prompt length
- Latency jumps while cost stays flat → the provider is overloaded
The on-call rule: Before closing any AI incident, check ALL THREE dashboards. A quality regression hides behind stable latency. A cost spike hides behind stable quality.
Production AI System Architecture
There Are No Dumb Questions
"What's p50, p95, p99?"
These are percentiles. p50 = the median (half of requests are faster, half are slower). p95 = 95% of requests are faster than this number. p99 = 99% are faster. You care about p95 and p99 because those are the worst experiences your users actually have — the median looks fine even when 5% of users are waiting 15 seconds.
"What's 'observability'?"
It's just a fancy word for "being able to see what's happening inside your system." Logs, metrics, dashboards, alerts — anything that lets you answer "why is this broken?" without guessing.
Cutting costs in the right order
Not all optimisations are equal. Here's Priya's playbook, ranked by impact:
The rule: Tier 1 changes deliver 10-100× more savings than all Tier 3 changes combined. Always start at the top.
Tier 1a: Prompt caching — stop paying for the same prompt twice
If your system prompt is the same on every request (and it usually is), you're paying full price to process identical text thousands of times a day. Prompt caching tells the API: "Hey, you've seen this prompt before — reuse the processed version."
Requirements:
- Prompt must be at least 1,024 tokens (most system prompts are)
- Works on Anthropic (90% discount — cached tokens cost 10% of standard rate) and OpenAI (50% discount on cached tokens — verify current rates at each provider's pricing page)
Priya's system prompt was 8,000 tokens, repeated on every request. With caching, she hit a 90% cache rate. Savings: ~$1,890/month (based on early 2025 rates)
Tier 1b: Model routing — stop paying specialist prices for basic tasks
You learned this in the LLM Fundamentals module, but here's how it plays out in production. Priya added a lightweight classifier that tags each incoming message:
| Category | % of traffic | Model | Cost per request |
|---|---|---|---|
| Simple (FAQ, status check) | 60% | Haiku | $0.003 (approximate — check Anthropic's pricing page for current rates) |
| Complex (troubleshooting) | 40% | Sonnet | $0.04 (approximate — check Anthropic's pricing page for current rates) |
Savings: $840/month.
Tier 2: Reduce RAG context
Priya's retrieval pipeline fetched 8 chunks per query. She ran an eval comparing 8 chunks vs. 4 chunks — quality stayed the same. She cut to 4. Half the context tokens, same quality.
Savings: $420/month.
The final bill
| Before | After | Savings |
|---|---|---|
| $4,200/month | $1,050/month | 75% reduction |
(Based on illustrative early 2025 pricing; actual savings depend on current rates)
And the hardest part? Convincing the team to cut RAG chunks before they had eval coverage to prove it was safe. (They built evals first — the Evals module paid for itself.)
Cost Optimisation Order
25 XPLatency: why your users are waiting
AI API calls are slow — a typical response takes 1-8 seconds. Users notice. Here's where the time goes:
| Phase | What happens | Typical time |
|---|---|---|
| Network | Request travels to the API | 50-200 ms |
| Time to First Token (TTFT) | Model processes your input | 200-2,000 ms |
| Token generation | Model outputs tokens one by one | 1,000-6,000 ms |
| Post-processing | Your code parses and processes | 10-100 ms |
TTFT (Time to First Token) is the most important metric for user-facing features. It's the delay before the first character appears on screen. A 3-second TTFT feels like the app is frozen; a 200ms TTFT feels instant (even if the full response takes 5 seconds) because the user sees text appearing immediately.
How to reduce TTFT:
- Shorter input prompts (fewer tokens to process before generating)
- Prompt caching (cached prompts process faster)
- Model routing (smaller models have faster TTFT)
- Streaming (show tokens as they generate instead of waiting for the full response)
There Are No Dumb Questions
"What's streaming?"
Without streaming, you wait for the entire response to generate, then show it all at once. With streaming, each token appears on screen as soon as it's generated — like watching someone type. Same total wait time, but the perceived wait is much shorter because the user sees progress immediately.
"When does streaming NOT help?"
When your code needs the complete response before it can do anything — like parsing JSON (you need the closing brace before you can parse). Or when the response is very short (nothing meaningful to stream).
Latency Doctor
50 XPQuality monitoring: catching silent regressions
Here's a nightmare scenario: the AI provider silently updates the model. Your latency and cost dashboards look fine. But response quality has dropped 20% — and you don't find out until a customer complains two weeks later.
The fix: Run an automated quality eval on a sample of live traffic. Every day, score 50-100 responses with your LLM-as-judge (from the Evals module). If the score drops below a threshold, trigger an alert.
| Metric | How to measure | Alert threshold |
|---|---|---|
| Eval score | LLM-as-judge on 50+ daily samples | Drop > 10% from baseline |
| User satisfaction | Thumbs up/down on responses | Thumbs down > 15% |
| Hallucination rate | Automated fact-checking against source docs | Rate > 5% |
Back to Priya's $4,200 bill. After the post-mortem, the team made two changes that week: they enabled prompt caching on the 6,000-token system prompt and routed FAQ queries to Haiku. Month 2 bill: $1,050. That's a 75% reduction from two configuration changes — no re-architecture, no new infrastructure. The three dashboards caught a third problem (TTFT spiking when the cache missed) before users noticed. Month 3 was stable. Production AI is mostly instrumentation and configuration.
Key takeaways
- Instrument latency, cost, AND quality before you ship. Any one can look fine while another explodes.
- Optimise costs in order: Prompt caching first, then model routing, then everything else. Tier 1 delivers 10-100× more savings than Tier 3.
- TTFT matters more than total latency for user-facing features. Streaming + shorter prompts + caching reduce it.
- Monitor quality continuously. Silent model updates can degrade quality without touching latency or cost metrics.
Knowledge Check
1.Anthropic offers prompt caching. What is the minimum prompt length required to be eligible, and what is the discount on cached input token reads?
2.You serve 10,000 requests/day averaging 800 input tokens and 200 output tokens. Approximately how much would you save monthly by routing all traffic from GPT-4o ($2.50/M input, $10/M output) to GPT-4o-mini ($0.15/M input, $0.60/M output) — OpenAI pricing as of early 2025; verify current rates at openai.com/pricing?
3.What is time-to-first-token (TTFT), why does it matter more than total latency in streaming applications, and which choice most directly reduces it?
4.A production AI feature starts returning lower quality responses after a provider silently updates the underlying model. Which monitoring metric would have surfaced this regression earliest?