In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > ADK Error Handling and Retry Patterns

ADK Error Handling and Retry Patterns

Author: Venkata Sudhakar

Production ADK agents encounter errors constantly ï¿½ external APIs time out, databases return unexpected results, third-party services go down temporarily. An agent that crashes or returns a confusing error message on the first failure is not production-ready. Resilient agents handle errors gracefully at every layer: tools catch and return structured error responses so the agent can explain the issue conversationally, retry logic handles transient failures transparently, and fallback tools provide degraded-but-functional responses when primary sources are unavailable.

The key design principle is that tools should never raise unhandled exceptions to the agent. Instead, tools return structured dicts that include an error key when something goes wrong. The agent reads the error, reasons about it, and responds appropriately ï¿½ apologising, suggesting alternatives, or asking the user to try again. For transient errors like network timeouts, wrap tool calls with exponential backoff retry logic. For persistent failures, provide a fallback tool that returns cached or static data so the agent can still give a useful response.

The below example shows a resilient order tracking agent with error handling at tool level, automatic retry with exponential backoff for network errors, and a fallback to cached data when the primary order API is unavailable.

import time, random
from google.adk.agents import Agent
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.genai import types as genai_types

# Simulated order API that sometimes fails (30% failure rate)
def _call_order_api(order_id: str) -> dict:
    if random.random() < 0.3:  # simulate 30% transient failure
        raise ConnectionError("Order API timeout - service temporarily unavailable")
    orders = {
        "ORD-88421": {"status": "Out for delivery", "eta": "Today 7pm",
                      "courier": "Delhivery", "tracking": "DL789456123"},
        "ORD-55987": {"status": "Delivered", "date": "30 March 2025"}
    }
    return orders.get(order_id.upper(), {"not_found": True})

# Cached fallback data (stale but better than nothing)
ORDER_CACHE = {
    "ORD-88421": {"status": "In transit (cached)", "note": "Live status temporarily unavailable"},
    "ORD-55987": {"status": "Delivered (cached)", "note": "Live status temporarily unavailable"}
}

def get_order_status(order_id: str) -> dict:
    # Retry up to 3 times with exponential backoff
    last_error = ""
    for attempt in range(3):
        try:
            result = _call_order_api(order_id)
            if result.get("not_found"):
                return {"error": "Order " + order_id + " not found. Please verify the order ID."}
            return result  # success
        except ConnectionError as e:
            last_error = str(e)
            if attempt < 2:
                wait = 2 ** attempt  # 1s, 2s backoff
                print("Attempt " + str(attempt+1) + " failed. Retrying in " + str(wait) + "s...")
                time.sleep(wait)

# All retries failed - try cache
    cached = ORDER_CACHE.get(order_id.upper())
    if cached:
        return {"cached": True, **cached}

# No cache either - return graceful error
    return {
        "error": "Order tracking is temporarily unavailable. "
                 "Please try again in a few minutes or call 1800-SHOPMAX.",
        "last_error": last_error
    }

agent = Agent(
    model="gemini-2.0-flash",
    name="resilient_order_agent",
    instruction=(
        "You are a ShopMax India support agent. "
        "When a tool returns an error key, explain the issue warmly and suggest alternatives. "
        "When data is cached (cached=True), mention it may not be fully up to date. "
        "Never expose raw error messages to the customer."
    ),
    tools=[get_order_status]
)

Testing the agent under normal and failure conditions,

It gives the following output showing graceful handling at all failure levels,

Q: Where is order ORD-88421?
Attempt 1 failed. Retrying in 1s...
Attempt 2 failed. Retrying in 2s...
[Cache fallback used]
Your order ORD-88421 shows as in transit based on our last update. Our live
tracking is temporarily experiencing delays, so this may not reflect the
very latest status. For the most current information, please call 1800-SHOPMAX
or check the Delhivery app with your tracking number.

Q: Track order ORD-99999 (invalid ID)
I was not able to find order ORD-99999 in our system. Could you double-check
the order ID from your confirmation email? It should start with ORD- followed
by 5 digits. I am happy to try again once you have the correct number!

# Retry logic: 3 attempts with 1s/2s backoff before fallback
# Cache fallback: stale data is better than a crash or empty response
# Agent never exposed the raw ConnectionError to the customer
# Invalid order: structured error returned, agent responds conversationally

Error handling architecture for production: tools are the error boundary ï¿½ all exceptions are caught inside the tool function and returned as structured error dicts. The agent layer never sees raw Python exceptions. Use three retry attempts with exponential backoff (1s, 2s, 4s) for all network-dependent tools. Always provide a cache or static fallback for critical tools like order status, product availability, and account balance ï¿½ users tolerate slightly stale data far better than a complete failure. Log every tool error with the session_id and error details using the ADK callback system (Tutorial 325) so your operations team can monitor tool failure rates and fix root causes proactively.

Send your comments, suggestions or queries regarding this site to [email protected].