Large Language Models Demystified
The technology behind ChatGPT and Claude — explained so you actually understand it, not just use it.
The intern who read the entire internet
Imagine you hired an intern and said: "Before you start, read everything. Every book in every library. Every Wikipedia article. Every Reddit thread. Every cookbook, legal brief, and love letter ever posted online. All of it."
Two months later, the intern shows up. They can write poetry, debug code, explain quantum physics, and draft legal contracts. They've never practiced any of these skills — they just read so many examples that they absorbed the patterns.
That intern is a large language model. And the "reading everything" part? OpenAI CEO Sam Altman stated in 2023 that training GPT-4 cost more than $100 million — though OpenAI has not officially released figures. Companies are now spending billions on their next models.
The result: a system that's shockingly good at generating human-like text — and shockingly bad at things you'd consider simple, like counting the number of r's in "strawberry." Understanding why requires understanding how these models actually work.
What makes an LLM "large"
The "large" in Large Language Model refers to one thing: the number of parameters (the adjustable weights from our neural networks module).
| Model | Parameters | Analogy |
|---|---|---|
| A tiny neural network | 1,000 | A calculator |
| A medium network | 1 million | A smartphone |
| GPT-3 (2020) | 175 billion | A small city's worth of calculators |
| GPT-4 (2023) | ~1.8 trillion (unverified est.*) | Every calculator on Earth |
| Claude, Llama, Gemini | Billions to trillions | Same ballpark |
OpenAI has not officially confirmed GPT-4's parameter count. The 1.8 trillion figure comes from unverified third-party analysis. What matters is the order of magnitude — modern frontier models operate in the hundreds of billions to potentially trillions of parameters.
More parameters = more "volume knobs" the model can adjust = more complex patterns it can learn. But there's a law of diminishing returns — going from 1 billion to 10 billion parameters makes a huge difference. Going from 100 billion to 200 billion? Smaller improvement.
The size also determines the cost. More parameters means:
- More GPUs to train (months of thousands of GPUs running 24/7)
- More GPUs to run (every user query passes through all those parameters)
- More electricity (training a large model can use as much energy as a small town uses in a year)
There Are No Dumb Questions
"Is bigger always better?"
No. Smaller models that are well-trained on high-quality data can outperform larger models trained on mediocre data. Meta's Llama 3 8B (8 billion parameters) outperforms other models of similar or larger size on many benchmarks because it was trained on better data (per Meta's benchmark comparisons, 2024). The trend in 2024-2025 has been toward smaller, more efficient models — not just bigger ones.
"What's a parameter, again? I heard about them in the neural networks module."
Same thing — a weight in the neural network. When people say GPT-4 has 1.8 trillion parameters, they mean 1.8 trillion volume knobs that were adjusted during training. Each knob controls how much attention one connection between neurons pays to its input.
The transformer: the architecture that changed everything
Every modern LLM is built on an architecture called the transformer, invented by Google researchers in 2017 (Vaswani et al., 2017). The key innovation: attention.
Before transformers, language models read text one word at a time, left to right — like reading through a straw. They struggled with long sentences because by the time they reached the end, they'd forgotten the beginning.
The transformer reads all words simultaneously and uses attention to decide which words matter most for each other word. Think of attention as a highlighter:
You're reading this sentence: "The cat sat on the mat because it was tired."
What does "it" refer to? The cat or the mat? You know instantly — because you pay attention to the relationship between "it" and "cat" (not "mat," because mats don't get tired).
The transformer does the same thing, mathematically. For every word, it calculates an "attention score" with every other word, figuring out which relationships matter.
This attention mechanism is why transformers are so good at language. They can track relationships across entire paragraphs, pages, even book-length inputs — something previous architectures couldn't do.
Play the Attention Game
25 XPHow an LLM gets built: the training pipeline
Building an LLM happens in three stages. Each stage serves a different purpose, and each one changes the model's behavior in a specific way.
Stage 1: Pre-training — reading the internet
The model reads trillions of tokens of text — books, websites, code repositories, scientific papers. Its only job during this stage: predict the next word.
Given "The Eiffel Tower is located in ___," the model learns that "Paris" is the most likely next word. It does this billions of times, for every possible text sequence.
How an LLM generates text — one token at a time
15 tokens
After pre-training, the model is like a very well-read person with no social skills. It can complete any text pattern it's seen, but if you ask it a question, it might respond with another question, or continue your text as if it's writing an essay, or do something completely unhelpful.
Stage 2: Fine-tuning — learning to be helpful
Humans write thousands of example conversations:
- User: What's the capital of France?
- Assistant: The capital of France is Paris.
The model trains on these examples to learn the format of being a helpful assistant — answer questions directly, be clear and concise, use a conversational tone. This stage is like teaching that well-read person how to actually have a conversation instead of just reciting encyclopedia entries.
Stage 3: RLHF — learning from human feedback
RLHF stands for Reinforcement Learning from Human Feedback. Here's how it works:
- The model generates two different responses to the same question
- A human reviewer picks which response is better
- The model adjusts to produce more responses like the winner
Think of it as a teacher grading essays. The teacher doesn't write the essay for the student — they just say "this one is better than that one, and here's why." Over thousands of comparisons, the student (model) learns what "good" looks like.
RLHF is the stage that teaches models to:
- Refuse harmful requests ("I can't help you build a weapon")
- Admit uncertainty ("I'm not sure about that")
- Stay on topic instead of rambling
- Be genuinely helpful rather than just technically correct
There Are No Dumb Questions
"If the model learned everything from the internet, does it 'know' everything on the internet?"
No. It learned patterns from the internet, not facts. It knows that "Paris" often follows "capital of France" because that pattern appeared millions of times. But it doesn't have a database of facts it can look up. That's why it sometimes confidently states things that are wrong — the pattern-matching produced a plausible-sounding but incorrect result.
"Why do companies keep training new models instead of just updating the old ones?"
Pre-training is a one-shot deal — you can't easily add new knowledge to a pre-trained model. The model's knowledge is frozen at the time of training. To include knowledge about events after the training cutoff, companies either retrain from scratch (expensive) or use techniques like RAG (retrieval-augmented generation) to give the model access to current information at query time.
Match the Behavior to the Training Stage
50 XPTokens, context windows, and temperature: the controls you need to know
These three concepts come up in every conversation about LLMs. Let's demystify each one.
Tokens: the units of language
LLMs don't read words — they read tokens. A token is a chunk of text, usually about three-quarters of a word.
| Text | Tokens | Count |
|---|---|---|
| "Hello" | ["Hello"] | 1 |
| "Hello, world!" | ["Hello", ",", " world", "!"] | 4 |
| "Artificial intelligence" | ["Art", "ificial", " intelligence"] | 3 |
| "ChatGPT" | ["Chat", "G", "PT"] | 3 |
The quick estimate: 1 word ≈ 1.33 tokens. Or: 100 words ≈ 133 tokens.
Why do you care? Because you pay per token. Every input token and every output token costs money. A verbose prompt that could be written in half the words literally costs twice as much.
Context window: the model's short-term memory
The context window is the maximum number of tokens the model can process in a single conversation — your input AND the model's output combined.
| Model | Context window | Roughly... |
|---|---|---|
| GPT-4o | 128k tokens | A 300-page book |
| Claude Sonnet | 200k tokens | A 500-page book |
| Gemini 1.5 Pro | 2M tokens (~5,000-page book, expanded from 1M in mid-2024) (as of mid-2024 — Gemini model generations and context windows evolve rapidly; verify current specs at ai.google.dev) | A 5,000-page book |
If your conversation exceeds the context window, the model either crashes or starts "forgetting" the earliest parts of the conversation. It's like a whiteboard with limited space — when you run out of room, you have to erase the oldest notes.
Temperature: the creativity dial
Temperature controls how "creative" or "random" the model's responses are.
| Temperature | Behavior | Use case |
|---|---|---|
| 0 | Always picks the most likely next token | Factual answers, data extraction, code |
| 0.3-0.7 | Mostly picks likely tokens, occasionally surprises | General conversation, analysis |
| 1.0+ | Everything's on the table | Creative writing, brainstorming |
Think of temperature like the "shuffle" setting on a playlist:
- Temperature 0: Plays the #1 most popular song every time. Predictable.
- Temperature 0.7: Usually plays popular songs but sometimes throws in a deep cut.
- Temperature 1.5: You might hear anything — album tracks, B-sides, experimental stuff.
Temperature Tuning
25 XPCreative — more surprising word choices. Good for brainstorming and writing.
Estimate the Cost
25 XPHow LLMs connect to everything else
This module is the bridge between general AI knowledge and specialised skills. Here's how LLMs connect to each track:
No matter which track you follow from here, the concepts in this module — tokens, context windows, temperature, the training pipeline, attention — will keep coming back. You now have the vocabulary to understand them.
Back to the intern
That intern who read the entire internet? Now you understand why they're so strange.
Ask them to write a sonnet — flawless, because they read a million sonnets. Ask them what happened in the news this morning — blank stare, because their "reading" stopped at the training cutoff. Ask them to count the letters in "strawberry" — wrong, because they tokenize words, not letters. Ask them to invent a plausible-sounding court case citation — they'll do it confidently, because "plausible" is literally all they produce.
The intern is brilliant and limited in exactly the ways you'd expect from someone who learned entirely by pattern absorption and has no mechanism for saying "I don't know." Now that you understand the mechanism, you can work with it — and around it.
Key takeaways
- LLMs predict the next token, one at a time. Every capability and every failure traces back to this fundamental loop.
- "Large" means billions of parameters. More parameters = more patterns learned, but also more cost to train and run.
- Transformers use attention to track word relationships. This is why they handle language so much better than previous architectures — they can look at all words simultaneously.
- Three training stages, three purposes. Pre-training (knowledge from reading), fine-tuning (learning to be helpful), RLHF (learning to be safe and aligned with human preferences).
- Tokens are the currency. You pay per token, so knowing how to estimate token counts directly affects your costs.
- Temperature controls creativity vs. consistency. Low for facts and code, high for brainstorming and creative work.
- LLMs don't know facts — they predict patterns. That's why they hallucinate, and why you need verification for anything high-stakes.
Knowledge Check
1.What is the key innovation of the transformer architecture that made modern LLMs possible?
2.During RLHF (Reinforcement Learning from Human Feedback), what role do human reviewers play?
3.A model confidently tells a user that 'the first person to walk on Mars was Neil Armstrong in 1999.' What explains each part of this failure?
4.Your application sends a 150,000-token input to a model with a 128,000-token context window. What happens?