Context Engineering
Learn how to decide what information to put in the prompt, where to retrieve it from, and how to arrange it so the model can use it.
The model only knows what you tell it
Here's a conversation that happens every day:
Engineer: "Why is the bot making up pricing info?" Also the engineer: never put the pricing page in the prompt
An LLM doesn't have a database. It doesn't browse the web during your API call. The only information it can use is what's sitting inside the prompt right now. If the pricing changed last week and you didn't give it the updated document, it will cheerfully invent a number — and sound 100% confident doing it.
Context engineering is the discipline of deciding: What information goes into the prompt? Where do you get it? How do you arrange it so the model actually uses it?
Get this right, and your 30%-error-rate bot becomes a 4%-error-rate bot. Get it wrong, and no amount of prompt engineering will save you.
Standard — handles most documents and conversations.
RAG: Give the model real documents to read
Retrieval-Augmented Generation (RAG) is the most common solution. Instead of hoping the model "remembers" the right answer from training, you look up the relevant information and inject it into the prompt before the model generates a response.
Think of it like an open-book exam vs. a closed-book exam:
- Without RAG = closed-book. The model relies on what it memorised during training. Hope it studied your pricing page.
- With RAG = open-book. You hand the model the exact pages it needs. Much harder to get wrong.
How RAG works (two phases)
Phase 1: Index time — preparing your documents
This happens once (and again whenever documents change):
-
Chunk your documents. Split each document into pieces of ~500 tokens each. Why? Because you won't send the whole document — just the relevant pieces.
-
Overlap the chunks. Each chunk should overlap the previous one by about 10% (~50 tokens). This prevents important sentences from being sliced in half at a boundary.
-
Embed each chunk. An embedding model converts each text chunk into a list of numbers (a "vector") that captures its meaning. Similar meanings = similar numbers.
-
Store in a vector database. Save each vector alongside the original text and metadata (source URL, page number, etc.).
Phase 2: Query time — answering a question
This happens on every user request:
-
Embed the question. Use the same embedding model to convert the user's question into a vector.
-
Search for similar chunks. Find the 3-5 chunks in your database whose vectors are closest to the question's vector. "Closest" means most similar in meaning.
-
Build the prompt. Stuff those chunks into the system prompt inside a clearly labelled "context" block.
-
Generate. The LLM reads the context and the question, then generates an answer grounded in real documents.
There Are No Dumb Questions
"Why not just send ALL the documents every time?"
Cost and accuracy. Sending 500 articles = ~1,000,000 tokens per query = ~$3.00 per question. With RAG, you send 4 chunks × 500 tokens = 2,000 tokens = ~$0.006. That's 500× cheaper. (based on early 2025 Sonnet input pricing — verify at anthropic.com/pricing) Plus, models are more accurate when they get only relevant information — dumping everything in creates noise that dilutes the signal.
"Why do I need a special 'embedding model'? Can't I just search by keywords?"
Keyword search finds exact word matches. Embedding search finds meaning matches. A user asks "How do I change my password?" — keyword search might miss your article titled "Account Security Settings" because it doesn't contain the word "password." Embedding search finds it because the meanings are similar.
Real example: Maria's support bot
Maria, a backend engineer, shipped a support bot covering 500 help articles. Before RAG: the bot hallucinated 30% of pricing answers because the model's training data was months out of date.
After adding RAG:
| Setting | Value | Why |
|---|---|---|
| Chunk size | 500 tokens | Small enough to be specific, large enough to have context |
| Overlap | 50 tokens (10%) | Prevents facts from being cut at chunk boundaries |
| Embedding model | text-embedding-3-small | Good quality, low cost |
| Top-k | 4 chunks retrieved | Enough context without adding noise |
| Storage | pgvector (PostgreSQL extension) | Uses existing Postgres infra |
Result: Error rate dropped from 30% to under 4%. Each query costs $0.006 instead of $3.00 (based on early 2025 Sonnet input pricing — verify at anthropic.com/pricing).
The instruction that made it work: In the system prompt, Maria added:
Answer ONLY based on the context provided below.
If the context doesn't contain the answer, say "I don't have that information."
Do NOT use your training knowledge for factual claims.
That one instruction — "answer only from context" — is what stops the hallucination. Without it, the model will happily mix real retrieved facts with made-up ones.
RAG Math
25 XPThe three knobs you need to tune
RAG has three settings that control quality. Turn them wrong and your bot gives bad answers. Turn them right and it's nearly as good as a human.
Knob 1: Chunk size
| Too small (100 tokens) | Just right (300-500 tokens) | Too big (2,000 tokens) |
|---|---|---|
| Each chunk has so little context it's useless on its own | Each chunk is a self-contained idea with enough detail | Chunks include lots of irrelevant text alongside the relevant bits |
Knob 2: Top-k (how many chunks to retrieve)
| Too low (k=1) | Just right (k=3-5) | Too high (k=20) |
|---|---|---|
| Misses relevant information that's spread across multiple chunks | Gets enough context without overwhelming the model | Adds noise, increases cost, and can confuse the model |
Knob 3: Overlap
| No overlap (0%) | Just right (10-15%) | Too much overlap (50%) |
|---|---|---|
| Important sentences get sliced at chunk boundaries and lost | Edge sentences appear in both chunks, nothing gets lost | Wastes storage and retrieves duplicate information |
RAG Doctor
50 XPBeyond basic RAG: when simple retrieval isn't enough
Sometimes a question can't be answered from a single retrieval pass. Two patterns handle this:
Multi-hop retrieval: The answer spans multiple documents. The system breaks the question into sub-questions, retrieves chunks for each, and combines them. Example: "Compare our refund policy with our competitor's" requires retrieving from two different sources.
Re-ranking: The initial retrieval returns 20 candidates. A second, more accurate model (a "cross-encoder") re-scores them and keeps only the best 5. This is slower but more precise — useful when retrieval quality matters more than speed.
Back to the pricing hallucination
The engineer from the opening finally put the pricing page in the prompt.
It took 20 minutes to add a RAG step that fetched the current pricing document on every request. Hallucination rate on pricing questions dropped from 31% to 2%. The bot's error rate fell from 30% to 4%, matching the module's opening promise.
"Why was it making stuff up?" became "Why wasn't I doing this from the start?"
Key takeaways
- The prompt is the model's entire world. If the information isn't in there, the model will make it up. RAG puts real documents in the prompt.
- Chunk size of 300-500 tokens, overlap of 10-15%, top-k of 3-5 — these defaults work for most RAG systems. Tune from there.
- "Answer only from context" — this one system prompt instruction is the difference between a bot that hallucinates and one that doesn't.
- RAG is 500× cheaper than stuffing all documents into every prompt. It's not just more accurate — it's dramatically cheaper.
Knowledge Check
1.A user asks a question whose answer spans three separate document chunks retrieved independently. What architectural pattern addresses this problem?
2.What is the practical trade-off between text-embedding-3-small and text-embedding-3-large?
3.A model has a 128k token context window. Why is retrieval still often preferable to stuffing all documents into the context?
4.What does re-ranking mean in a RAG pipeline, and which step does it replace or augment?