O
Octo
CoursesPricing
O
Octo
CoursesPricingDashboardPrivacyTerms

© 2026 Octo

Building AI-Powered Products
1LLMs Are Next-Token Predictors — Everything Follows From That2Prompt Design3Context Engineering4API Integration: Retries, Backoff, and Graceful Fallbacks5Evals: Measure Before You Improve6Agent Architecture7Safety & Guardrails: Defense in Depth8Production AI — Latency, Cost, and Observability
Module 3~20 min

Context Engineering

Learn how to decide what information to put in the prompt, where to retrieve it from, and how to arrange it so the model can use it.

The model only knows what you tell it

Here's a conversation that happens every day:

Engineer: "Why is the bot making up pricing info?" Also the engineer: never put the pricing page in the prompt

An LLM doesn't have a database. It doesn't browse the web during your API call. The only information it can use is what's sitting inside the prompt right now. If the pricing changed last week and you didn't give it the updated document, it will cheerfully invent a number — and sound 100% confident doing it.

Context engineering is the discipline of deciding: What information goes into the prompt? Where do you get it? How do you arrange it so the model actually uses it?

Get this right, and your 30%-error-rate bot becomes a 4%-error-rate bot. Get it wrong, and no amount of prompt engineering will save you.

context-window4096 tokens
512 tokens128000 tokens

Standard — handles most documents and conversations.

RAG: Give the model real documents to read

Retrieval-Augmented Generation (RAG) is the most common solution. Instead of hoping the model "remembers" the right answer from training, you look up the relevant information and inject it into the prompt before the model generates a response.

Think of it like an open-book exam vs. a closed-book exam:

  • Without RAG = closed-book. The model relies on what it memorised during training. Hope it studied your pricing page.
  • With RAG = open-book. You hand the model the exact pages it needs. Much harder to get wrong.

How RAG works (two phases)

Phase 1: Index time — preparing your documents

This happens once (and again whenever documents change):

  1. Chunk your documents. Split each document into pieces of ~500 tokens each. Why? Because you won't send the whole document — just the relevant pieces.

  2. Overlap the chunks. Each chunk should overlap the previous one by about 10% (~50 tokens). This prevents important sentences from being sliced in half at a boundary.

  3. Embed each chunk. An embedding model converts each text chunk into a list of numbers (a "vector") that captures its meaning. Similar meanings = similar numbers.

  4. Store in a vector database. Save each vector alongside the original text and metadata (source URL, page number, etc.).

Phase 2: Query time — answering a question

This happens on every user request:

  1. Embed the question. Use the same embedding model to convert the user's question into a vector.

  2. Search for similar chunks. Find the 3-5 chunks in your database whose vectors are closest to the question's vector. "Closest" means most similar in meaning.

  3. Build the prompt. Stuff those chunks into the system prompt inside a clearly labelled "context" block.

  4. Generate. The LLM reads the context and the question, then generates an answer grounded in real documents.

💭You're Probably Wondering…

There Are No Dumb Questions

"Why not just send ALL the documents every time?"

Cost and accuracy. Sending 500 articles = ~1,000,000 tokens per query = ~$3.00 per question. With RAG, you send 4 chunks × 500 tokens = 2,000 tokens = ~$0.006. That's 500× cheaper. (based on early 2025 Sonnet input pricing — verify at anthropic.com/pricing) Plus, models are more accurate when they get only relevant information — dumping everything in creates noise that dilutes the signal.

"Why do I need a special 'embedding model'? Can't I just search by keywords?"

Keyword search finds exact word matches. Embedding search finds meaning matches. A user asks "How do I change my password?" — keyword search might miss your article titled "Account Security Settings" because it doesn't contain the word "password." Embedding search finds it because the meanings are similar.

⚠️More context isn't always better
Stuffing 100,000 tokens of documents into a context window doesn't guarantee the model reads all of it well. Research shows models better recall information at the beginning and end of the context — the "lost in the middle" problem (Liu et al., 2023). Retrieve the most relevant 3–10 documents rather than everything remotely related.

Real example: Maria's support bot

Maria, a backend engineer, shipped a support bot covering 500 help articles. Before RAG: the bot hallucinated 30% of pricing answers because the model's training data was months out of date.

After adding RAG:

SettingValueWhy
Chunk size500 tokensSmall enough to be specific, large enough to have context
Overlap50 tokens (10%)Prevents facts from being cut at chunk boundaries
Embedding modeltext-embedding-3-smallGood quality, low cost
Top-k4 chunks retrievedEnough context without adding noise
Storagepgvector (PostgreSQL extension)Uses existing Postgres infra

Result: Error rate dropped from 30% to under 4%. Each query costs $0.006 instead of $3.00 (based on early 2025 Sonnet input pricing — verify at anthropic.com/pricing).

The instruction that made it work: In the system prompt, Maria added:

Answer ONLY based on the context provided below.
If the context doesn't contain the answer, say "I don't have that information."
Do NOT use your training knowledge for factual claims.

That one instruction — "answer only from context" — is what stops the hallucination. Without it, the model will happily mix real retrieved facts with made-up ones.

⚡

RAG Math

25 XP
Maria's support bot uses the settings above. Calculate the costs: 1. **Per query with RAG:** 4 chunks × 500 tokens = ? tokens. At $3/million tokens (Sonnet), cost = ? 2. **Per query WITHOUT RAG** (sending all 500 articles, avg 2,000 tokens each): 500 × 2,000 = ? tokens. Cost = ? 3. **Ratio:** How many times more expensive is the no-RAG approach? _Hint: Work through #1 step by step: chunks × tokens per chunk = total tokens. Then apply the pricing formula: (total tokens ÷ 1,000,000) × price per million. Use that same formula for #2 — the numbers will be very different._

The three knobs you need to tune

RAG has three settings that control quality. Turn them wrong and your bot gives bad answers. Turn them right and it's nearly as good as a human.

Knob 1: Chunk size

Too small (100 tokens)Just right (300-500 tokens)Too big (2,000 tokens)
Each chunk has so little context it's useless on its ownEach chunk is a self-contained idea with enough detailChunks include lots of irrelevant text alongside the relevant bits

Knob 2: Top-k (how many chunks to retrieve)

Too low (k=1)Just right (k=3-5)Too high (k=20)
Misses relevant information that's spread across multiple chunksGets enough context without overwhelming the modelAdds noise, increases cost, and can confuse the model

Knob 3: Overlap

No overlap (0%)Just right (10-15%)Too much overlap (50%)
Important sentences get sliced at chunk boundaries and lostEdge sentences appear in both chunks, nothing gets lostWastes storage and retrieves duplicate information

⚡

RAG Doctor

50 XP
Your RAG bot is sick. Diagnose the problem and prescribe the fix. **Symptom 1:** A user asks "How do I reset my password?" and the answer blends password-reset info AND billing history. Three possible treatments: - (a) Shrink chunk size from 1,500 tokens to 300 tokens - (b) Raise top-k from 1 to 5 - (c) Add a metadata filter so only chunks tagged "authentication" are retrieved for password questions **Which treatment fixes this symptom?** Write the letter and explain why. **Symptom 2:** A user asks about a policy that spans three paragraphs across two pages. The bot only gives a partial answer. Possible treatments: - (a) Increase top-k from 2 to 5 - (b) Set chunk overlap to 0% - (c) Reduce chunk size to 50 tokens **Which treatment fixes this symptom?** Write the letter and explain why. _Hint: For each symptom, diagnose the direction of the problem first — is the system pulling too much irrelevant content, or too little relevant content? Then look at which knob controls the quantity vs. the quality of what gets retrieved._

Beyond basic RAG: when simple retrieval isn't enough

Sometimes a question can't be answered from a single retrieval pass. Two patterns handle this:

Multi-hop retrieval: The answer spans multiple documents. The system breaks the question into sub-questions, retrieves chunks for each, and combines them. Example: "Compare our refund policy with our competitor's" requires retrieving from two different sources.

Re-ranking: The initial retrieval returns 20 candidates. A second, more accurate model (a "cross-encoder") re-scores them and keeps only the best 5. This is slower but more precise — useful when retrieval quality matters more than speed.

Back to the pricing hallucination

The engineer from the opening finally put the pricing page in the prompt.

It took 20 minutes to add a RAG step that fetched the current pricing document on every request. Hallucination rate on pricing questions dropped from 31% to 2%. The bot's error rate fell from 30% to 4%, matching the module's opening promise.

"Why was it making stuff up?" became "Why wasn't I doing this from the start?"


Key takeaways

  • The prompt is the model's entire world. If the information isn't in there, the model will make it up. RAG puts real documents in the prompt.
  • Chunk size of 300-500 tokens, overlap of 10-15%, top-k of 3-5 — these defaults work for most RAG systems. Tune from there.
  • "Answer only from context" — this one system prompt instruction is the difference between a bot that hallucinates and one that doesn't.
  • RAG is 500× cheaper than stuffing all documents into every prompt. It's not just more accurate — it's dramatically cheaper.

?

Knowledge Check

1.A user asks a question whose answer spans three separate document chunks retrieved independently. What architectural pattern addresses this problem?

2.What is the practical trade-off between text-embedding-3-small and text-embedding-3-large?

3.A model has a 128k token context window. Why is retrieval still often preferable to stuffing all documents into the context?

4.What does re-ranking mean in a RAG pipeline, and which step does it replace or augment?

Previous

Prompt Design

Next

API Integration: Retries, Backoff, and Graceful Fallbacks