API Integration: Retries, Backoff, and Graceful Fallbacks
Learn how to handle timeouts, rate limits, and server errors in production AI API calls using exponential backoff and graceful fallbacks.
The day the chatbot crashed (and it wasn't the model's fault)
Priya's team shipped a customer-support chatbot. Worked great at 200 requests/day. Then marketing sent an email blast and traffic jumped to 3,000 requests/hour.
The chatbot didn't just slow down — it died. Every single user saw an error page for 40 minutes. The CEO was not happy.
What happened? The API returned a "429 — Too Many Requests" error. Priya's code had no idea what to do with that error, so it crashed. Not a model problem. Not a prompt problem. A plumbing problem.
Most production AI outages are plumbing problems. The model works fine — but the code that calls the model falls apart when things go wrong.
The three types of API errors (and what to do about each)
Every API error belongs to one of three categories. Memorise this and you'll never be confused by an HTTP status code again:
| Category | Status codes | Who caused it? | What to do |
|---|---|---|---|
| Your fault | 400, 401, 403, 422 | You sent a bad request | Fix it. Don't retry — it'll fail the same way every time. |
| Server overloaded | 429, 529 | Too many requests | Wait and retry. The server is fine, it's just busy. |
| Server broken | 500, 502, 503 | Something crashed on their end | Retry with caution. Might work on the next try, might not. |
Think of it like calling a restaurant:
- 400 (your fault): "Sorry, we don't serve breakfast." Calling back won't change the menu.
- 429 (overloaded): "All tables are full right now." Call back in 20 minutes and you'll get a table.
- 500 (broken): "Our kitchen is on fire." Maybe call back later — or try a different restaurant.
There Are No Dumb Questions
"What's the difference between a 429 and a 529?"
Both mean "slow down." A 429 means YOU specifically are sending too many requests (you hit your rate limit). A 529 means the WHOLE service is overloaded (everyone is affected). The fix is the same: wait and retry.
"What about 200?"
That's the happy one — "everything worked!" Parse the response and move on.
API Request Lifecycle
Error Triage
25 XPExponential backoff: the polite way to retry
When you get a 429 or 500, you need to retry — but HOW you retry matters enormously.
Bad approach — hammer the server:
Attempt 1: wait 1 second, retry
Attempt 2: wait 1 second, retry
Attempt 3: wait 1 second, retry
If the server is overloaded, retrying every second makes it worse. You're banging on the door while they're trying to clean up.
Good approach — exponential backoff:
Attempt 1: wait 2 seconds, retry
Attempt 2: wait 4 seconds, retry
Attempt 3: wait 8 seconds, retry
Each wait doubles. This gives the server more and more breathing room to recover.
Best approach — exponential backoff + jitter:
Attempt 1: wait 2 + random(0-2) seconds, retry
Attempt 2: wait 4 + random(0-2) seconds, retry
Attempt 3: wait 8 + random(0-2) seconds, retry
Why the random extra delay? Imagine 1,000 clients all get a 429 at the same moment. Without jitter, all 1,000 retry at exactly the same time — creating an even bigger traffic spike. Jitter spreads their retries across time, like staggering when cars merge onto a highway.
This is called the thundering herd problem — and jitter solves it.
The naive code vs. the production code
Here's what Priya's code looked like before and after the crash:
Before: the crash-on-error version
import anthropic
client = anthropic.Anthropic()
def ask(prompt: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=512,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Any 429 or 500? Crash. No retry. No fallback. The whole service goes down.
After: the production version
import anthropic
from tenacity import (
retry, stop_after_attempt,
wait_exponential, wait_random, retry_if_exception
)
client = anthropic.Anthropic()
FALLBACK = "I'm temporarily unavailable, please try again."
def is_retryable(exc: Exception) -> bool:
if isinstance(exc, anthropic.RateLimitError):
return True
if isinstance(exc, anthropic.APIStatusError):
return exc.status_code in (429, 500, 502, 503, 529)
return False
@retry(
retry=retry_if_exception(is_retryable),
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=30)
+ wait_random(0, 2),
reraise=False,
)
def _call_api(prompt: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=512,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
def ask(prompt: str) -> str:
try:
return _call_api(prompt)
except Exception:
return FALLBACK
Let's break down what each piece does:
| Code | What it does | Why it matters |
|---|---|---|
is_retryable() | Only retries 429, 500, 502, 503, 529 | Doesn't waste time retrying 400s (your fault) |
stop_after_attempt(3) | Gives up after 3 tries | Prevents infinite retry loops |
wait_exponential(min=2, max=30) | Waits 2s, then 4s, then 8s... | Gives the server time to recover |
wait_random(0, 2) | Adds 0-2 seconds of jitter | Prevents thundering herd |
FALLBACK | Shows a friendly message | Users never see a raw error |
Result: The same traffic spike that crashed the old version caused zero user-visible failures with the new version. Response times went up (retries take time), but every request eventually resolved.
There Are No Dumb Questions
"Why only 3 retries? Why not 10?"
After 3 exponential retries (2s + 4s + 8s = 14 seconds of waiting), if the server is still down, more retries won't help — the problem is bigger than a temporary overload. At that point, show the fallback and alert your on-call team.
"What if even the fallback is wrong for my use case?"
The fallback is a design decision, not a technical one. For a chatbot, "try again later" is fine. For a payment processing system, you might queue the request instead of dropping it. Match the fallback to the stakes.
Build the Retry Logic
50 XPStreaming: showing progress instead of waiting
There's another kind of "slow" that frustrates users: the long wait for a response. Even if the model takes 5 seconds to generate 500 tokens, you don't have to show a blank screen for 5 seconds.
Streaming sends tokens to the user as they're generated — like watching someone type in real time instead of waiting for the full email.
When streaming helps: Chat interfaces, long-form generation, any UI where users watch the output appear.
When streaming doesn't help: Background processing, JSON parsing (you need the complete JSON before you can parse it), very short responses (nothing to stream).
✗ Without AI
- ✗User sees nothing for 5-30 seconds
- ✗Then full response appears at once
- ✗Feels slow and unresponsive
- ✗Simpler to implement
✓ With AI
- ✓First token appears in under 1 second
- ✓User reads while model generates
- ✓Feels fast and alive
- ✓Required for good UX on long responses
One more gotcha: truncated responses
The API returns a finish_reason (or stop_reason) field that tells you WHY the model stopped:
| finish_reason | What it means | What to do |
|---|---|---|
stop | Model finished naturally | All good |
max_tokens (or length) | Output was cut off because it hit the token limit | Your response is incomplete. Either increase max_tokens or handle truncation in your code. |
The trap: A truncated JSON response parses as invalid JSON. A truncated code block has missing closing brackets. If you're not checking finish_reason, you're shipping broken outputs to users without knowing.
Back to Priya's chatbot. After the marketing email blast took it down, Priya's team spent a day adding exponential backoff, a user-facing fallback message, and a per-user rate limit. The next spike — from a TechCrunch article two months later, three times the traffic — the chatbot handled without a hiccup. Users saw a "We're experiencing high demand — please wait a moment" message for about 8 seconds, then got their responses. The CEO called it a "resilience win." Priya knew: it was just plumbing.
Key takeaways
- Classify every error: your fault, overloaded, or broken. Never retry "your fault" errors (400). Always retry "overloaded" errors (429) with backoff.
- Exponential backoff + jitter prevents the thundering herd problem. Double the wait each time, add a random delay.
- Always have a fallback. Users should never see a raw error. Show a friendly message, queue the request, or try a different model.
- Check
finish_reason. If it saysmax_tokens, your response is truncated — handle it or increase the limit.
Knowledge Check
1.An API call returns a 429 status code. What does this mean, and what is the correct client-side response strategy?
2.What is the difference between streaming (stream=True) and waiting for a complete response, and when does streaming NOT improve perceived latency?
3.A model is asked to return JSON but occasionally returns a markdown code block containing JSON. Which parsing strategy correctly handles both cases?
4.What does finish_reason: "length" (or stop_reason: "max_tokens") indicate about the generation you received?