API Integration: Retries, Backoff, and Graceful Fallbacks — Building AI-Powered Products

The day the chatbot crashed (and it wasn't the model's fault)

Priya's team shipped a customer-support chatbot. Worked great at 200 requests/day. Then marketing sent an email blast and traffic jumped to 3,000 requests/hour.

The chatbot didn't just slow down — it died. Every single user saw an error page for 40 minutes. The CEO was not happy.

What happened? The API returned a "429 — Too Many Requests" error. Priya's code had no idea what to do with that error, so it crashed. Not a model problem. Not a prompt problem. A plumbing problem.

Most production AI outages are plumbing problems. The model works fine — but the code that calls the model falls apart when things go wrong.

The three types of API errors (and what to do about each)

Every API error belongs to one of three categories. Memorise this and you'll never be confused by an HTTP status code again:

Category	Status codes	Who caused it?	What to do
Your fault	400, 401, 403, 422	You sent a bad request	Fix it. Don't retry — it'll fail the same way every time.
Server overloaded	429, 529	Too many requests	Wait and retry. The server is fine, it's just busy.
Server broken	500, 502, 503	Something crashed on their end	Retry with caution. Might work on the next try, might not.

Think of it like calling a restaurant:

400 (your fault): "Sorry, we don't serve breakfast." Calling back won't change the menu.
429 (overloaded): "All tables are full right now." Call back in 20 minutes and you'll get a table.
500 (broken): "Our kitchen is on fire." Maybe call back later — or try a different restaurant.

💭You're Probably Wondering…

There Are No Dumb Questions

"What's the difference between a 429 and a 529?"

Both mean "slow down." A 429 means YOU specifically are sending too many requests (you hit your rate limit). A 529 means the WHOLE service is overloaded (everyone is affected). The fix is the same: wait and retry.

"What about 200?"

That's the happy one — "everything worked!" Parse the response and move on.

API Request Lifecycle

Your App

Auth Check (API Key)

Rate Limiter

LLM API

Response (JSON)

⚡

Error Triage

25 XP

You're on call and these errors come in. For each one, write: **your fault**, **overloaded**, or **broken**. Then write: **fix it**, **wait and retry**, or **retry with caution**. | Error code | Category | Action | |-----------|----------|--------| | 200 | ? | ? | | 400 | ? | ? | | 429 | ? | ? | | 500 | ? | ? | | 529 | ? | ? | **Bonus:** Why should you NEVER retry a 400 error? _Hint: A 400 means YOUR request is broken. Sending the same broken request again will get the same error. You're just wasting your rate limit budget._

Exponential backoff: the polite way to retry

When you get a 429 or 500, you need to retry — but HOW you retry matters enormously.

Bad approach — hammer the server:

Attempt 1: wait 1 second, retry
Attempt 2: wait 1 second, retry
Attempt 3: wait 1 second, retry

If the server is overloaded, retrying every second makes it worse. You're banging on the door while they're trying to clean up.

Good approach — exponential backoff:

Attempt 1: wait 2 seconds, retry
Attempt 2: wait 4 seconds, retry
Attempt 3: wait 8 seconds, retry

Each wait doubles. This gives the server more and more breathing room to recover.

Best approach — exponential backoff + jitter:

Attempt 1: wait 2 + random(0-2) seconds, retry
Attempt 2: wait 4 + random(0-2) seconds, retry
Attempt 3: wait 8 + random(0-2) seconds, retry

Why the random extra delay? Imagine 1,000 clients all get a 429 at the same moment. Without jitter, all 1,000 retry at exactly the same time — creating an even bigger traffic spike. Jitter spreads their retries across time, like staggering when cars merge onto a highway.

This is called the thundering herd problem — and jitter solves it.

⚠️Always set spending limits before going live

LLM APIs charge per token. A bug that runs the same request in a loop, or a prompt that's 10x longer than you expected, can generate a bill in minutes. Set a hard spending cap in your API provider's dashboard on day one. $50/month for development is reasonable. Unlimited is never reasonable.

The naive code vs. the production code

Here's what Priya's code looked like before and after the crash:

Before: the crash-on-error version

import anthropic

client = anthropic.Anthropic()

def ask(prompt: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Any 429 or 500? Crash. No retry. No fallback. The whole service goes down.

After: the production version

import anthropic
from tenacity import (
    retry, stop_after_attempt,
    wait_exponential, wait_random, retry_if_exception
)

client = anthropic.Anthropic()

FALLBACK = "I'm temporarily unavailable, please try again."

def is_retryable(exc: Exception) -> bool:
    if isinstance(exc, anthropic.RateLimitError):
        return True
    if isinstance(exc, anthropic.APIStatusError):
        return exc.status_code in (429, 500, 502, 503, 529)
    return False

@retry(
    retry=retry_if_exception(is_retryable),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30)
        + wait_random(0, 2),
    reraise=False,
)
def _call_api(prompt: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

def ask(prompt: str) -> str:
    try:
        return _call_api(prompt)
    except Exception:
        return FALLBACK

Let's break down what each piece does:

Code	What it does	Why it matters
`is_retryable()`	Only retries 429, 500, 502, 503, 529	Doesn't waste time retrying 400s (your fault)
`stop_after_attempt(3)`	Gives up after 3 tries	Prevents infinite retry loops
`wait_exponential(min=2, max=30)`	Waits 2s, then 4s, then 8s...	Gives the server time to recover
`wait_random(0, 2)`	Adds 0-2 seconds of jitter	Prevents thundering herd
`FALLBACK`	Shows a friendly message	Users never see a raw error

Result: The same traffic spike that crashed the old version caused zero user-visible failures with the new version. Response times went up (retries take time), but every request eventually resolved.

💭You're Probably Wondering…

There Are No Dumb Questions

"Why only 3 retries? Why not 10?"

After 3 exponential retries (2s + 4s + 8s = 14 seconds of waiting), if the server is still down, more retries won't help — the problem is bigger than a temporary overload. At that point, show the fallback and alert your on-call team.

"What if even the fallback is wrong for my use case?"

The fallback is a design decision, not a technical one. For a chatbot, "try again later" is fine. For a payment processing system, you might queue the request instead of dropping it. Match the fallback to the stakes.

⚡

Build the Retry Logic

50 XP

You're building a retry wrapper for an image generation API. Write pseudocode (or Python) that: 1. Attempts the API call up to 3 times 2. Uses exponential backoff: wait 1s, then 2s, then 4s 3. Only retries on 429 and 500 errors 4. On a 400 error, immediately logs the error and returns `None` 5. After 3 failures, returns a default placeholder image URL **Structure your code like this:** ``` def generate_image(prompt): for attempt in [1, 2, 3]: try: # call the API # return the image URL except ??? as e: if e.status_code == 400: # ??? elif e.status_code in (429, 500): # ??? return "https://placeholder.com/default.png" ``` _Hint: "Exponential" means the wait time grows by a multiplier each attempt, not by a fixed amount. If attempt 1 waits 1 second and the base is 2, what does attempt 2 wait? Attempt 3? Write out the pattern before you code it._

Streaming: showing progress instead of waiting

There's another kind of "slow" that frustrates users: the long wait for a response. Even if the model takes 5 seconds to generate 500 tokens, you don't have to show a blank screen for 5 seconds.

Streaming sends tokens to the user as they're generated — like watching someone type in real time instead of waiting for the full email.

When streaming helps: Chat interfaces, long-form generation, any UI where users watch the output appear.

When streaming doesn't help: Background processing, JSON parsing (you need the complete JSON before you can parse it), very short responses (nothing to stream).

✗ Without AI

✗User sees nothing for 5-30 seconds
✗Then full response appears at once
✗Feels slow and unresponsive
✗Simpler to implement

✓ With AI

✓First token appears in under 1 second
✓User reads while model generates
✓Feels fast and alive
✓Required for good UX on long responses

One more gotcha: truncated responses

The API returns a finish_reason (or stop_reason) field that tells you WHY the model stopped:

finish_reason	What it means	What to do
`stop`	Model finished naturally	All good
`max_tokens` (or `length`)	Output was cut off because it hit the token limit	Your response is incomplete. Either increase `max_tokens` or handle truncation in your code.

The trap: A truncated JSON response parses as invalid JSON. A truncated code block has missing closing brackets. If you're not checking finish_reason, you're shipping broken outputs to users without knowing.

Back to Priya's chatbot. After the marketing email blast took it down, Priya's team spent a day adding exponential backoff, a user-facing fallback message, and a per-user rate limit. The next spike — from a TechCrunch article two months later, three times the traffic — the chatbot handled without a hiccup. Users saw a "We're experiencing high demand — please wait a moment" message for about 8 seconds, then got their responses. The CEO called it a "resilience win." Priya knew: it was just plumbing.

Key takeaways

Classify every error: your fault, overloaded, or broken. Never retry "your fault" errors (400). Always retry "overloaded" errors (429) with backoff.
Exponential backoff + jitter prevents the thundering herd problem. Double the wait each time, add a random delay.
Always have a fallback. Users should never see a raw error. Show a friendly message, queue the request, or try a different model.
Check finish_reason. If it says max_tokens, your response is truncated — handle it or increase the limit.

Knowledge Check

1.An API call returns a 429 status code. What does this mean, and what is the correct client-side response strategy?

2.What is the difference between streaming (stream=True) and waiting for a complete response, and when does streaming NOT improve perceived latency?

3.A model is asked to return JSON but occasionally returns a markdown code block containing JSON. Which parsing strategy correctly handles both cases?

4.What does finish_reason: "length" (or stop_reason: "max_tokens") indicate about the generation you received?

The day the chatbot crashed (and it wasn't the model's fault)

Priya's team shipped a customer-support chatbot. Worked great at 200 requests/day. Then marketing sent an email blast and traffic jumped to 3,000 requests/hour.

The chatbot didn't just slow down — it died. Every single user saw an error page for 40 minutes. The CEO was not happy.

Most production AI outages are plumbing problems. The model works fine — but the code that calls the model falls apart when things go wrong.

The three types of API errors (and what to do about each)

Every API error belongs to one of three categories. Memorise this and you'll never be confused by an HTTP status code again:

Category	Status codes	Who caused it?	What to do
Your fault	400, 401, 403, 422	You sent a bad request	Fix it. Don't retry — it'll fail the same way every time.
Server overloaded	429, 529	Too many requests	Wait and retry. The server is fine, it's just busy.
Server broken	500, 502, 503	Something crashed on their end	Retry with caution. Might work on the next try, might not.

Think of it like calling a restaurant:

400 (your fault): "Sorry, we don't serve breakfast." Calling back won't change the menu.
429 (overloaded): "All tables are full right now." Call back in 20 minutes and you'll get a table.
500 (broken): "Our kitchen is on fire." Maybe call back later — or try a different restaurant.

💭You're Probably Wondering…

There Are No Dumb Questions

"What's the difference between a 429 and a 529?"

"What about 200?"

That's the happy one — "everything worked!" Parse the response and move on.

API Request Lifecycle

Your App

Auth Check (API Key)

Rate Limiter

LLM API

Response (JSON)

⚡

Error Triage

25 XP

Exponential backoff: the polite way to retry

When you get a 429 or 500, you need to retry — but HOW you retry matters enormously.

Bad approach — hammer the server:

Attempt 1: wait 1 second, retry
Attempt 2: wait 1 second, retry
Attempt 3: wait 1 second, retry

If the server is overloaded, retrying every second makes it worse. You're banging on the door while they're trying to clean up.

Good approach — exponential backoff:

Attempt 1: wait 2 seconds, retry
Attempt 2: wait 4 seconds, retry
Attempt 3: wait 8 seconds, retry

Each wait doubles. This gives the server more and more breathing room to recover.

Best approach — exponential backoff + jitter:

Attempt 1: wait 2 + random(0-2) seconds, retry
Attempt 2: wait 4 + random(0-2) seconds, retry
Attempt 3: wait 8 + random(0-2) seconds, retry

This is called the thundering herd problem — and jitter solves it.

⚠️Always set spending limits before going live

The naive code vs. the production code

Here's what Priya's code looked like before and after the crash:

Before: the crash-on-error version

import anthropic

client = anthropic.Anthropic()

def ask(prompt: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Any 429 or 500? Crash. No retry. No fallback. The whole service goes down.

After: the production version

import anthropic
from tenacity import (
    retry, stop_after_attempt,
    wait_exponential, wait_random, retry_if_exception
)

client = anthropic.Anthropic()

FALLBACK = "I'm temporarily unavailable, please try again."

def is_retryable(exc: Exception) -> bool:
    if isinstance(exc, anthropic.RateLimitError):
        return True
    if isinstance(exc, anthropic.APIStatusError):
        return exc.status_code in (429, 500, 502, 503, 529)
    return False

@retry(
    retry=retry_if_exception(is_retryable),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30)
        + wait_random(0, 2),
    reraise=False,
)
def _call_api(prompt: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

def ask(prompt: str) -> str:
    try:
        return _call_api(prompt)
    except Exception:
        return FALLBACK

Let's break down what each piece does:

Code	What it does	Why it matters
`is_retryable()`	Only retries 429, 500, 502, 503, 529	Doesn't waste time retrying 400s (your fault)
`stop_after_attempt(3)`	Gives up after 3 tries	Prevents infinite retry loops
`wait_exponential(min=2, max=30)`	Waits 2s, then 4s, then 8s...	Gives the server time to recover
`wait_random(0, 2)`	Adds 0-2 seconds of jitter	Prevents thundering herd
`FALLBACK`	Shows a friendly message	Users never see a raw error

💭You're Probably Wondering…

There Are No Dumb Questions

"Why only 3 retries? Why not 10?"

"What if even the fallback is wrong for my use case?"

⚡

Build the Retry Logic

50 XP

Streaming: showing progress instead of waiting

There's another kind of "slow" that frustrates users: the long wait for a response. Even if the model takes 5 seconds to generate 500 tokens, you don't have to show a blank screen for 5 seconds.

Streaming sends tokens to the user as they're generated — like watching someone type in real time instead of waiting for the full email.

When streaming helps: Chat interfaces, long-form generation, any UI where users watch the output appear.

When streaming doesn't help: Background processing, JSON parsing (you need the complete JSON before you can parse it), very short responses (nothing to stream).

✗ Without AI

✗User sees nothing for 5-30 seconds
✗Then full response appears at once
✗Feels slow and unresponsive
✗Simpler to implement

✓ With AI

✓First token appears in under 1 second
✓User reads while model generates
✓Feels fast and alive
✓Required for good UX on long responses

One more gotcha: truncated responses

The API returns a finish_reason (or stop_reason) field that tells you WHY the model stopped:

finish_reason	What it means	What to do
`stop`	Model finished naturally	All good
`max_tokens` (or `length`)	Output was cut off because it hit the token limit	Your response is incomplete. Either increase `max_tokens` or handle truncation in your code.

Key takeaways

Classify every error: your fault, overloaded, or broken. Never retry "your fault" errors (400). Always retry "overloaded" errors (429) with backoff.
Exponential backoff + jitter prevents the thundering herd problem. Double the wait each time, add a random delay.
Always have a fallback. Users should never see a raw error. Show a friendly message, queue the request, or try a different model.
Check finish_reason. If it says max_tokens, your response is truncated — handle it or increase the limit.

Knowledge Check

1.An API call returns a 429 status code. What does this mean, and what is the correct client-side response strategy?

2.What is the difference between streaming (stream=True) and waiting for a complete response, and when does streaming NOT improve perceived latency?

3.A model is asked to return JSON but occasionally returns a markdown code block containing JSON. Which parsing strategy correctly handles both cases?

4.What does finish_reason: "length" (or stop_reason: "max_tokens") indicate about the generation you received?