LLMs Are Next-Token Predictors — Everything Follows From That
Understand the token prediction loop, estimate token costs with the ¾-word rule, and apply model routing to cut inference spend.
You already know how an LLM works
Open your phone. Start typing a text message. See those word suggestions above the keyboard? Tap one. Now another. Keep tapping suggested words and you get a sentence — maybe a weird one, but a sentence.
You just did what an LLM does. It picks the next word (well, token — we'll get to that), adds it to the sentence, then picks the next one. Over and over. Thousands of times per response.
That's it. That's the whole trick. Everything else — the cost, the speed, the mistakes, the magic — flows from that one loop.
The prediction loop, step by step
Here's what happens every time you send a message to an LLM:
Let's walk through each box:
Step 1 — The Shredder (Tokeniser). Your message gets chopped into small pieces called tokens. Think of a paper shredder — it doesn't care about your words, it just cuts at fixed points. "Hello, world!" becomes four pieces: Hello , world !.
Step 2 — The Number Translator. Each piece gets a number. "Hello" might be #9906. The computer only understands numbers, so every piece needs an ID — like how every student in school gets a student number.
Step 3 — The Meaning Map (Embedding). The model places each numbered piece on a giant map where similar meanings are close together. "Happy" and "joyful" are neighbours. "Happy" and "refrigerator" are far apart. This map has hundreds of dimensions — way more than the 2D maps you're used to.
Step 4 — The Brain (Transformer). This is where the magic lives. The transformer looks at all the pieces on the map and asks: "Given everything I've seen so far, what piece should come next?" It scores every possible next piece and picks one.
Step 5 — Loop. The chosen piece gets added to the input. The transformer runs again. And again. And again — until it decides it's done (by outputting a special "stop" token).
There Are No Dumb Questions
"Wait — it loops? So a 500-word answer means the transformer runs 500+ times?"
Yep. Every single token requires a full pass through the transformer. A 500-token answer costs roughly 500x more compute than a 1-token answer. That's why long responses are expensive and slow.
"Does it plan ahead? Like, does it know how the sentence will end?"
Nope. Zero planning. It only ever picks the next token. It's like writing a story one word at a time without knowing where it's going. The fact that the output usually makes sense is what's remarkable — and why it sometimes doesn't.
Be the LLM
25 XPTokens are the currency — learn to count them
Every API call costs money. The price tag? Tokens. Not words — tokens. So you need to know how tokens and words relate.
Here's the cheat code:
The ¾-Word Rule: 1 word ≈ 1.33 tokens. Or flip it: 1 token ≈ ¾ of a word.
To estimate: token count = word count × 1.33
Caveat: This rule holds for typical English prose. Code, non-English text, and technical terms often tokenize less efficiently — sometimes 2–3× more tokens per word. For accurate billing estimates, validate with your model provider's tokenizer before building a production cost model.
Let's see why. Here's how "Hello, world!" gets tokenised:
See? Two words became four tokens. That comma and exclamation mark each count separately. Spaces before words get bundled with the word ( world is one token, not two).
Why this matters for your wallet: Verbose prompts with lots of punctuation and formatting burn more tokens than they look. Trimming filler from your prompts is free money.
Token Estimation Race
25 XPThe context window: your model's short-term memory
Every model has a context window — the maximum number of tokens it can hold in memory at once. Think of it like a desk: you can only spread out so many papers before things start falling off.
| Model | Context window | Roughly how many words |
|---|---|---|
| Claude Haiku | 200k tokens | ~150,000 words |
| Claude Sonnet | 200k tokens | ~150,000 words |
| GPT-4o | 128k tokens | ~96,000 words |
(figures for claude-haiku-3 and claude-sonnet-4 as of 2025 — verify model-specific limits at anthropic.com)
Pricing is illustrative and changes frequently — check provider documentation for current rates.
Here's the trap: the context window includes both your input AND the model's output. So if you stuff 190k tokens of input into a 200k-token window, the model can only generate a 10k-token response before it hits the wall.
And here's the worse trap: blow the context window and your API call fails at runtime, not at design time. No compiler error. No warning. Your app just crashes in production when a user sends a long enough input.
There Are No Dumb Questions
"If the context window is 200k tokens, can the model actually USE all 200k equally well?"
No! Research shows models are worst at finding information stuck in the middle of a long context. They're best at stuff near the beginning and the end. It's called the "lost in the middle" problem (Liu et al., 2023) — like how you remember the first and last items on a grocery list but forget the middle ones.
Standard — handles most documents and conversations.
Model routing: stop paying luxury prices for basic tasks
Here's a real scenario. Priya, a backend engineer at a legal-tech startup, blew through her API budget in week one. She sent every document — no matter how simple — to Claude Sonnet. Monday morning: $39.90 in charges from a single overnight run. 1,000 legal briefs, all routed to Sonnet. At that rate, monthly spend would hit $1,197.
Then she ran the numbers:
| Sonnet ($3/M tokens) | Haiku ($0.25/M tokens) | Savings | |
|---|---|---|---|
| 1 brief (10k words ≈ 13,300 tokens) | $0.04 | ~$0.0033 | ~92% |
| 1,000 briefs/day | $40/day | ~$3.33/day | ~92% |
| Monthly (30 days) | $1,200 | ~$100 | ~92% |
The fix? Model routing — sending each task to the cheapest model that can handle it well.
Think of it like a hospital triage system:
(Pricing as of early 2025 — verify current rates at anthropic.com/pricing)
"Extract the party names from this contract" — that's a Haiku job. No reasoning needed. $0.003.
"Identify every clause that creates liability and rank them by risk" — that demands deeper reasoning. Sonnet earns its cost here.
How a prompt flows through an LLM
25 tokens
The starting point: For many pipelines, the majority of calls are simple extractions or classifications that smaller models handle well — reserve larger models for tasks that genuinely require complex reasoning. The right split depends on your specific task mix; measure accuracy on each tier with a sample eval before committing to a routing strategy in production.
Triage Challenge
50 XPWhy LLMs make stuff up (and what to do about it)
Here's a conversation between Token (an LLM token) and User about hallucinations:
User: Why do you sometimes make up facts that sound totally real?
Token: Because I don't know facts. I predict what token comes next based on patterns I saw during training. If the most likely next token after "The capital of Australia is" is "Sydney" — because lots of text on the internet says that — I'll say Sydney. Even though it's wrong. (It's Canberra.)
User: That's terrifying. Can you at least tell me when you're not sure?
Token: Not really. My confidence score tells you how likely I think a token is compared to alternatives. It does NOT tell you whether the statement is true. I can be 99% confident about a completely false statement — because the pattern I learned was wrong, or the context is misleading.
User: So how do engineers deal with this?
Token: Three ways. RAG (Retrieval-Augmented Generation) — give me real documents to reference so I'm not relying only on my training data. Evals — test my answers against known-correct answers systematically. Human review gates — have a human check my work for high-stakes outputs.
There Are No Dumb Questions
"If LLMs just predict tokens, how do they seem to 'reason'?"
Great question. When you see an LLM work through a problem step by step, it's not "thinking" the way you do. It's predicting that the next most likely tokens form a reasoning chain — because it was trained on millions of examples of humans reasoning step by step. The output looks like reasoning because the training data contained reasoning. Whether it truly "understands" is still debated, but for engineering purposes: treat it as a sophisticated pattern matcher, not a thinker.
Temperature: the creativity dial
When the model scores all possible next tokens, temperature controls how it picks from that list.
| Temperature | What happens | Good for |
|---|---|---|
| 0 | Always picks the highest-scored token | Factual answers, code, extraction |
| 0.3–0.7 | Usually picks high-scored tokens but sometimes goes off-script | General conversation, analysis |
| 1.0+ | Lower-scored tokens get a real shot | Creative writing, brainstorming |
Think of it like a restaurant ordering system:
- Temperature 0: You always order the #1 most popular dish. Predictable. Safe. Boring.
- Temperature 0.5: You usually order a popular dish but sometimes try something new.
- Temperature 1.0: You might order anything on the menu. Adventurous. Sometimes amazing. Sometimes... squid ink ice cream.
There Are No Dumb Questions
"If I set temperature to 0, will I get the exact same response every time?"
Surprisingly, no — not always. Even at temperature=0, floating-point math on different hardware can produce tiny rounding differences that occasionally change which token gets picked. It's mostly deterministic, but don't build systems that require identical outputs from identical inputs.
Creative — more surprising word choices. Good for brainstorming and writing.
How models get trained (the 60-second version)
You don't need to train models — but you need to know why they behave the way they do. Three stages:
Stage 1 — Pre-training: Reading the internet. The model reads billions of web pages, books, and code. It learns patterns: grammar, facts, writing styles, reasoning patterns. This is expensive (millions of dollars) and produces a model that can complete text but isn't helpful yet — like a kid who's read every book in the library but has zero social skills.
Stage 2 — Fine-tuning: Learning to be helpful. Humans write example conversations: "If a user asks X, a good response looks like Y." The model trains on thousands of these examples to learn how to be an assistant, not just a text completer.
Stage 3 — RLHF: Learning from feedback. Humans rate the model's responses: "This answer was great, this one was terrible." The model adjusts to produce more highly-rated outputs. This is the stage that teaches models to refuse harmful requests, stay on topic, and be genuinely useful. This post-training alignment stage — using human feedback to reward helpful, safe responses — is central to the behaviour you see in frontier models. (Modern models like GPT-4o use a combination of RLHF, supervised fine-tuning, and additional alignment techniques; RLHF is one key component. Anthropic trains Claude via Constitutional AI — a related approach that uses AI-generated feedback to supplement human feedback, reducing (but not eliminating) reliance on human raters.)
Match the behaviour to the stage
25 XP2. The model responds in a conversational Q&A format →
Key takeaways
- The whole game is next-token prediction. Every LLM feature, cost, and failure mode traces back to this one loop.
- Tokens ≠ words. Use the ¾-word rule (words × 1.33) to estimate token counts and costs before you build.
- Route aggressively. Send simple tasks to cheap models. Reserve expensive models for hard reasoning. That single change can significantly cut pipeline costs.
- Temperature controls creativity vs. consistency. Low for facts, high for brainstorming.
- LLMs don't know things — they predict things. That's why they hallucinate, and why you need RAG, evals, and human review.
Knowledge Check
1.A pipeline processes 1,000 legal briefs per day, each approximately 10,000 words. Using the ¾-word rule and Claude Sonnet pricing of $3 per million input tokens (as of early 2025), what is the estimated daily input cost?
2.You stuff a 90k-token legal document into a Claude Sonnet prompt and ask a question whose answer appears on page 40 of 80. Based on the 'lost in the middle' research, what should you expect?
3.Why doesn't setting temperature=0 guarantee that two identical API calls return identical output?
4.Which post-training approach is most directly responsible for teaching models like GPT-4o to decline harmful requests and follow instructions helpfully?