Large Language Models Demystified — Understanding AI

The intern who read the entire internet

Imagine you hired an intern and said: "Before you start, read everything. Every book in every library. Every Wikipedia article. Every Reddit thread. Every cookbook, legal brief, and love letter ever posted online. All of it."

Two months later, the intern shows up. They can write poetry, debug code, explain quantum physics, and draft legal contracts. They've never practiced any of these skills — they just read so many examples that they absorbed the patterns.

That intern is a large language model. And the "reading everything" part? OpenAI CEO Sam Altman stated in 2023 that training GPT-4 cost more than $100 million — though OpenAI has not officially released figures. Companies are now spending billions on their next models.

The result: a system that's shockingly good at generating human-like text — and shockingly bad at things you'd consider simple, like counting the number of r's in "strawberry." Understanding why requires understanding how these models actually work.

What makes an LLM "large"

The "large" in Large Language Model refers to one thing: the number of parameters (the adjustable weights from our neural networks module).

Model	Parameters	Analogy
A tiny neural network	1,000	A calculator
A medium network	1 million	A smartphone
GPT-3 (2020)	175 billion	A small city's worth of calculators
GPT-4 (2023)	~1.8 trillion (unverified est.*)	Every calculator on Earth
Claude, Llama, Gemini	Billions to trillions	Same ballpark

175BGPT-3 parameters (2020)

1800BEstimated GPT-4 parameters

13TTokens in GPT-4 training data (third-party est., unverified)

OpenAI has not officially confirmed GPT-4's parameter count. The 1.8 trillion figure comes from unverified third-party analysis. What matters is the order of magnitude — modern frontier models operate in the hundreds of billions to potentially trillions of parameters.

More parameters = more "volume knobs" the model can adjust = more complex patterns it can learn. But there's a law of diminishing returns — going from 1 billion to 10 billion parameters makes a huge difference. Going from 100 billion to 200 billion? Smaller improvement.

The size also determines the cost. More parameters means:

More GPUs to train (months of thousands of GPUs running 24/7)
More GPUs to run (every user query passes through all those parameters)
More electricity (training a large model can use as much energy as a small town uses in a year)

💭You're Probably Wondering…

There Are No Dumb Questions

"Is bigger always better?"

No. Smaller models that are well-trained on high-quality data can outperform larger models trained on mediocre data. Meta's Llama 3 8B (8 billion parameters) outperforms other models of similar or larger size on many benchmarks because it was trained on better data (per Meta's benchmark comparisons, 2024). The trend in 2024-2025 has been toward smaller, more efficient models — not just bigger ones.

"What's a parameter, again? I heard about them in the neural networks module."

Same thing — a weight in the neural network. When people say GPT-4 has 1.8 trillion parameters, they mean 1.8 trillion volume knobs that were adjusted during training. Each knob controls how much attention one connection between neurons pays to its input.

The transformer: the architecture that changed everything

Every modern LLM is built on an architecture called the transformer, invented by Google researchers in 2017 (Vaswani et al., 2017). The key innovation: attention.

Before transformers, language models read text one word at a time, left to right — like reading through a straw. They struggled with long sentences because by the time they reached the end, they'd forgotten the beginning.

The transformer reads all words simultaneously and uses attention to decide which words matter most for each other word. Think of attention as a highlighter:

You're reading this sentence: "The cat sat on the mat because it was tired."

What does "it" refer to? The cat or the mat? You know instantly — because you pay attention to the relationship between "it" and "cat" (not "mat," because mats don't get tired).

The transformer does the same thing, mathematically. For every word, it calculates an "attention score" with every other word, figuring out which relationships matter.

This attention mechanism is why transformers are so good at language. They can track relationships across entire paragraphs, pages, even book-length inputs — something previous architectures couldn't do.

⚡

Play the Attention Game

25 XP

For each sentence, identify which word the highlighted pronoun most likely "attends to" (refers to). Then rate the difficulty: would a simple left-to-right model get this right? 1. "**Sarah** told **Maria** that **she** had won the lottery." — Who won? ___ - Difficulty for a left-to-right model: Easy / Hard? ___ 2. "The **trophy** didn't fit in the **suitcase** because **it** was too big." — What was too big? ___ - Would changing "big" to "small" change the answer? ___ 3. "The **developers** built the **app** and **they** launched **it** on Tuesday." — "They" refers to ___. "It" refers to ___. _Hint: These are called "coreference resolution" problems. Sentence #2 is famous in AI research — it's called a Winograd Schema. The meaning of "it" flips entirely based on one word (big vs. small). Transformers handle these well because attention lets them weigh all relationships simultaneously._

How an LLM gets built: the training pipeline

Building an LLM happens in three stages. Each stage serves a different purpose, and each one changes the model's behavior in a specific way.

Stage 1: Pre-training — reading the internet

The model reads trillions of tokens of text — books, websites, code repositories, scientific papers. Its only job during this stage: predict the next word.

Given "The Eiffel Tower is located in ___," the model learns that "Paris" is the most likely next word. It does this billions of times, for every possible text sequence.

How an LLM generates text — one token at a time

ThecapitalofFranceisParis,acityknownfortheEiffelTower.

15 tokens

After pre-training, the model is like a very well-read person with no social skills. It can complete any text pattern it's seen, but if you ask it a question, it might respond with another question, or continue your text as if it's writing an essay, or do something completely unhelpful.

Stage 2: Fine-tuning — learning to be helpful

Humans write thousands of example conversations:

User: What's the capital of France?
Assistant: The capital of France is Paris.

The model trains on these examples to learn the format of being a helpful assistant — answer questions directly, be clear and concise, use a conversational tone. This stage is like teaching that well-read person how to actually have a conversation instead of just reciting encyclopedia entries.

Stage 3: RLHF — learning from human feedback

RLHF stands for Reinforcement Learning from Human Feedback. Here's how it works:

The model generates two different responses to the same question
A human reviewer picks which response is better
The model adjusts to produce more responses like the winner

Think of it as a teacher grading essays. The teacher doesn't write the essay for the student — they just say "this one is better than that one, and here's why." Over thousands of comparisons, the student (model) learns what "good" looks like.

RLHF is the stage that teaches models to:

Refuse harmful requests ("I can't help you build a weapon")
Admit uncertainty ("I'm not sure about that")
Stay on topic instead of rambling
Be genuinely helpful rather than just technically correct

💭You're Probably Wondering…

There Are No Dumb Questions

"If the model learned everything from the internet, does it 'know' everything on the internet?"

No. It learned patterns from the internet, not facts. It knows that "Paris" often follows "capital of France" because that pattern appeared millions of times. But it doesn't have a database of facts it can look up. That's why it sometimes confidently states things that are wrong — the pattern-matching produced a plausible-sounding but incorrect result.

"Why do companies keep training new models instead of just updating the old ones?"

Pre-training is a one-shot deal — you can't easily add new knowledge to a pre-trained model. The model's knowledge is frozen at the time of training. To include knowledge about events after the training cutoff, companies either retrain from scratch (expensive) or use techniques like RAG (retrieval-augmented generation) to give the model access to current information at query time.

⚡

Match the Behavior to the Training Stage

50 XP

For each model behavior, identify which training stage is *primarily* responsible. Write **Pre-training**, **Fine-tuning**, or **RLHF**. | Behavior | Stage | Your reasoning | |----------|-------|----------------| | The model knows that water boils at 100 degrees Celsius | ? | ? | | The model responds in a Q&A format when you ask a question | ? | ? | | The model refuses to provide instructions for making explosives | ? | ? | | The model can write Python code | ? | ? | | The model says "I'm not confident about this" instead of guessing | ? | ? | | The model structures long answers with headers and bullet points | ? | ? | Now for the harder question: A model gives a factually wrong answer in a confident, helpful, well-formatted tone. Which stage(s) are responsible for each part of that failure? - The wrong fact came from: ___ - The confident tone came from: ___ - The helpful format came from: ___ _Hint: Think about what each stage changes and what it preserves. Pre-training runs on internet-scale data. Fine-tuning runs on curated conversational data. RLHF runs on human feedback about quality. Each stage shapes a different dimension of the output — and a failure that touches all three at once is telling you something interesting._

Tokens, context windows, and temperature: the controls you need to know

These three concepts come up in every conversation about LLMs. Let's demystify each one.

Tokens: the units of language

LLMs don't read words — they read tokens. A token is a chunk of text, usually about three-quarters of a word.

Text	Tokens	Count
"Hello"	["Hello"]	1
"Hello, world!"	["Hello", ",", " world", "!"]	4
"Artificial intelligence"	["Art", "ificial", " intelligence"]	3
"ChatGPT"	["Chat", "G", "PT"]	3

The quick estimate: 1 word ≈ 1.33 tokens. Or: 100 words ≈ 133 tokens.

Why do you care? Because you pay per token. Every input token and every output token costs money. A verbose prompt that could be written in half the words literally costs twice as much.

Context window: the model's short-term memory

The context window is the maximum number of tokens the model can process in a single conversation — your input AND the model's output combined.

Model	Context window	Roughly...
GPT-4o	128k tokens	A 300-page book
Claude Sonnet	200k tokens	A 500-page book
Gemini 1.5 Pro	2M tokens (~5,000-page book, expanded from 1M in mid-2024) (as of mid-2024 — Gemini model generations and context windows evolve rapidly; verify current specs at ai.google.dev)	A 5,000-page book

If your conversation exceeds the context window, the model either crashes or starts "forgetting" the earliest parts of the conversation. It's like a whiteboard with limited space — when you run out of room, you have to erase the oldest notes.

Temperature: the creativity dial

Temperature controls how "creative" or "random" the model's responses are.

Temperature	Behavior	Use case
0	Always picks the most likely next token	Factual answers, data extraction, code
0.3-0.7	Mostly picks likely tokens, occasionally surprises	General conversation, analysis
1.0+	Everything's on the table	Creative writing, brainstorming

Think of temperature like the "shuffle" setting on a playlist:

Temperature 0: Plays the #1 most popular song every time. Predictable.
Temperature 0.7: Usually plays popular songs but sometimes throws in a deep cut.
Temperature 1.5: You might hear anything — album tracks, B-sides, experimental stuff.

⚡

Temperature Tuning

25 XP

For each use case, recommend a temperature setting (low: 0-0.3, medium: 0.4-0.7, high: 0.8-1.2) and explain why. 1. Extracting customer names and addresses from invoices → Temperature: ___ 2. Writing marketing taglines for a new product → Temperature: ___ 3. Answering customer support questions about return policies → Temperature: ___ 4. Generating plot ideas for a novel → Temperature: ___ 5. Converting a JSON schema into TypeScript types → Temperature: ___ _Hint: The rule of thumb — if there's one right answer, go low. If you want variety and creativity, go high. If you want helpful and conversational but not wild, go medium._

temperature1

Creative — more surprising word choices. Good for brainstorming and writing.

⚡

Estimate the Cost

25 XP

You want to use an LLM to summarise customer support emails. Each email averages 400 words, and you get 500 emails per day. The LLM charges $0.25 per million input tokens and $1.25 per million output tokens. Each summary is about 80 words. 1. Roughly how many input tokens per day? (Rule of thumb: 1 word ≈ 1.3 tokens) 2. Roughly how many output tokens per day? 3. What's the approximate daily cost? 4. Your CFO asks: "Should we process all 500 emails, or only the ones customers flag as urgent (~50/day)?" What do you recommend, and why? *Hint: The math is simple — multiply words × 1.3 to get tokens, then multiply by the per-token price. The real question is #4: is processing all emails worth it, or is a smarter filter more cost-effective?*

How LLMs connect to everything else

This module is the bridge between general AI knowledge and specialised skills. Here's how LLMs connect to each track:

No matter which track you follow from here, the concepts in this module — tokens, context windows, temperature, the training pipeline, attention — will keep coming back. You now have the vocabulary to understand them.

Back to the intern

That intern who read the entire internet? Now you understand why they're so strange.

Ask them to write a sonnet — flawless, because they read a million sonnets. Ask them what happened in the news this morning — blank stare, because their "reading" stopped at the training cutoff. Ask them to count the letters in "strawberry" — wrong, because they tokenize words, not letters. Ask them to invent a plausible-sounding court case citation — they'll do it confidently, because "plausible" is literally all they produce.

The intern is brilliant and limited in exactly the ways you'd expect from someone who learned entirely by pattern absorption and has no mechanism for saying "I don't know." Now that you understand the mechanism, you can work with it — and around it.

Key takeaways

LLMs predict the next token, one at a time. Every capability and every failure traces back to this fundamental loop.
"Large" means billions of parameters. More parameters = more patterns learned, but also more cost to train and run.
Transformers use attention to track word relationships. This is why they handle language so much better than previous architectures — they can look at all words simultaneously.
Three training stages, three purposes. Pre-training (knowledge from reading), fine-tuning (learning to be helpful), RLHF (learning to be safe and aligned with human preferences).
Tokens are the currency. You pay per token, so knowing how to estimate token counts directly affects your costs.
Temperature controls creativity vs. consistency. Low for facts and code, high for brainstorming and creative work.
LLMs don't know facts — they predict patterns. That's why they hallucinate, and why you need verification for anything high-stakes.

Knowledge Check

1.What is the key innovation of the transformer architecture that made modern LLMs possible?

2.During RLHF (Reinforcement Learning from Human Feedback), what role do human reviewers play?

3.A model confidently tells a user that 'the first person to walk on Mars was Neil Armstrong in 1999.' What explains each part of this failure?

4.Your application sends a 150,000-token input to a model with a 128,000-token context window. What happens?

The intern who read the entire internet

What makes an LLM "large"

The "large" in Large Language Model refers to one thing: the number of parameters (the adjustable weights from our neural networks module).

Model	Parameters	Analogy
A tiny neural network	1,000	A calculator
A medium network	1 million	A smartphone
GPT-3 (2020)	175 billion	A small city's worth of calculators
GPT-4 (2023)	~1.8 trillion (unverified est.*)	Every calculator on Earth
Claude, Llama, Gemini	Billions to trillions	Same ballpark

175BGPT-3 parameters (2020)

1800BEstimated GPT-4 parameters

13TTokens in GPT-4 training data (third-party est., unverified)

The size also determines the cost. More parameters means:

More GPUs to train (months of thousands of GPUs running 24/7)
More GPUs to run (every user query passes through all those parameters)
More electricity (training a large model can use as much energy as a small town uses in a year)

💭You're Probably Wondering…

There Are No Dumb Questions

"Is bigger always better?"

"What's a parameter, again? I heard about them in the neural networks module."

The transformer: the architecture that changed everything

Every modern LLM is built on an architecture called the transformer, invented by Google researchers in 2017 (Vaswani et al., 2017). The key innovation: attention.

The transformer reads all words simultaneously and uses attention to decide which words matter most for each other word. Think of attention as a highlighter:

You're reading this sentence: "The cat sat on the mat because it was tired."

What does "it" refer to? The cat or the mat? You know instantly — because you pay attention to the relationship between "it" and "cat" (not "mat," because mats don't get tired).

The transformer does the same thing, mathematically. For every word, it calculates an "attention score" with every other word, figuring out which relationships matter.

⚡

Play the Attention Game

25 XP

How an LLM gets built: the training pipeline

Building an LLM happens in three stages. Each stage serves a different purpose, and each one changes the model's behavior in a specific way.

Stage 1: Pre-training — reading the internet

The model reads trillions of tokens of text — books, websites, code repositories, scientific papers. Its only job during this stage: predict the next word.

Given "The Eiffel Tower is located in ___," the model learns that "Paris" is the most likely next word. It does this billions of times, for every possible text sequence.

How an LLM generates text — one token at a time

ThecapitalofFranceisParis,acityknownfortheEiffelTower.

15 tokens

Stage 2: Fine-tuning — learning to be helpful

Humans write thousands of example conversations:

User: What's the capital of France?
Assistant: The capital of France is Paris.

Stage 3: RLHF — learning from human feedback

RLHF stands for Reinforcement Learning from Human Feedback. Here's how it works:

The model generates two different responses to the same question
A human reviewer picks which response is better
The model adjusts to produce more responses like the winner

RLHF is the stage that teaches models to:

Refuse harmful requests ("I can't help you build a weapon")
Admit uncertainty ("I'm not sure about that")
Stay on topic instead of rambling
Be genuinely helpful rather than just technically correct

💭You're Probably Wondering…

There Are No Dumb Questions

"If the model learned everything from the internet, does it 'know' everything on the internet?"

"Why do companies keep training new models instead of just updating the old ones?"

⚡

Match the Behavior to the Training Stage

50 XP

Tokens, context windows, and temperature: the controls you need to know

These three concepts come up in every conversation about LLMs. Let's demystify each one.

Tokens: the units of language

LLMs don't read words — they read tokens. A token is a chunk of text, usually about three-quarters of a word.

Text	Tokens	Count
"Hello"	["Hello"]	1
"Hello, world!"	["Hello", ",", " world", "!"]	4
"Artificial intelligence"	["Art", "ificial", " intelligence"]	3
"ChatGPT"	["Chat", "G", "PT"]	3

The quick estimate: 1 word ≈ 1.33 tokens. Or: 100 words ≈ 133 tokens.

Why do you care? Because you pay per token. Every input token and every output token costs money. A verbose prompt that could be written in half the words literally costs twice as much.

Context window: the model's short-term memory

The context window is the maximum number of tokens the model can process in a single conversation — your input AND the model's output combined.

Model	Context window	Roughly...
GPT-4o	128k tokens	A 300-page book
Claude Sonnet	200k tokens	A 500-page book
Gemini 1.5 Pro	2M tokens (~5,000-page book, expanded from 1M in mid-2024) (as of mid-2024 — Gemini model generations and context windows evolve rapidly; verify current specs at ai.google.dev)	A 5,000-page book

Temperature: the creativity dial

Temperature controls how "creative" or "random" the model's responses are.

Temperature	Behavior	Use case
0	Always picks the most likely next token	Factual answers, data extraction, code
0.3-0.7	Mostly picks likely tokens, occasionally surprises	General conversation, analysis
1.0+	Everything's on the table	Creative writing, brainstorming

Think of temperature like the "shuffle" setting on a playlist:

Temperature 0: Plays the #1 most popular song every time. Predictable.
Temperature 0.7: Usually plays popular songs but sometimes throws in a deep cut.
Temperature 1.5: You might hear anything — album tracks, B-sides, experimental stuff.

⚡

Temperature Tuning

25 XP

temperature1

Creative — more surprising word choices. Good for brainstorming and writing.

⚡

Estimate the Cost

25 XP

How LLMs connect to everything else

This module is the bridge between general AI knowledge and specialised skills. Here's how LLMs connect to each track:

Back to the intern

That intern who read the entire internet? Now you understand why they're so strange.

Key takeaways

LLMs predict the next token, one at a time. Every capability and every failure traces back to this fundamental loop.
"Large" means billions of parameters. More parameters = more patterns learned, but also more cost to train and run.
Transformers use attention to track word relationships. This is why they handle language so much better than previous architectures — they can look at all words simultaneously.
Three training stages, three purposes. Pre-training (knowledge from reading), fine-tuning (learning to be helpful), RLHF (learning to be safe and aligned with human preferences).
Tokens are the currency. You pay per token, so knowing how to estimate token counts directly affects your costs.
Temperature controls creativity vs. consistency. Low for facts and code, high for brainstorming and creative work.
LLMs don't know facts — they predict patterns. That's why they hallucinate, and why you need verification for anything high-stakes.

Knowledge Check

1.What is the key innovation of the transformer architecture that made modern LLMs possible?

2.During RLHF (Reinforcement Learning from Human Feedback), what role do human reviewers play?

3.A model confidently tells a user that 'the first person to walk on Mars was Neil Armstrong in 1999.' What explains each part of this failure?

4.Your application sends a 150,000-token input to a model with a 128,000-token context window. What happens?