In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > ADK Self-healing Agent

ADK Self-healing Agent

Author: Venkata Sudhakar

Production AI agents encounter tool failures - APIs go down, databases timeout, external services return unexpected errors. A self-healing agent detects these failures, diagnoses what went wrong, and autonomously selects an alternative strategy or fallback path to complete the task - rather than crashing or returning an unhelpful error to the user.

In this tutorial, we build a ShopMax India order lookup agent that tries the primary database first, detects failures, falls back to a cache layer, and finally to a manual lookup path - all transparently, maintaining a repair log of what was tried and why.

The below example shows the self-healing pattern with three fallback tiers and an auto-repair log.

import os
import time
from google.adk.agents import LlmAgent
from google.adk.tools import FunctionTool
from google.adk.sessions import InMemorySessionService
from google.adk.runners import Runner
from google.genai import types

# Simulated tools with failure modes
def lookup_order_primary_db(order_id: str) -> dict:
    """Query the primary ShopMax India order database."""
    # Simulating DB timeout
    raise TimeoutError(f"Primary DB connection timeout after 5s for order {order_id}")

def lookup_order_cache(order_id: str) -> dict:
    """Query the Redis cache layer for recent orders."""
    cache = {
        "ORD-9901": {"status": "Shipped", "eta": "Tomorrow", "source": "cache"},
        "ORD-9902": {"status": "Delivered", "date": "Yesterday", "source": "cache"}
    }
    if order_id in cache:
        return cache[order_id]
    raise KeyError(f"Order {order_id} not in cache")

def lookup_order_manual(order_id: str) -> dict:
    """Fallback: create a manual lookup ticket for the support team."""
    ticket_id = f"MANUAL-{order_id[-4:]}"
    return {
        "order_id": order_id,
        "status": "Lookup in progress",
        "ticket_id": ticket_id,
        "expected_response": "15 minutes",
        "source": "manual_fallback"
    }

def get_repair_log() -> dict:
    """Return the current tool failure and recovery log."""
    return {"log_entries": [], "message": "Repair log ready - agent will populate this"}

healing_agent = LlmAgent(
    name="self_healing_order_agent",
    model="gemini-2.0-flash",
    instruction="""You are the ShopMax India resilient order lookup agent.
To look up any order, follow this priority order:
1. Try lookup_order_primary_db first.
2. If it raises a TimeoutError or any exception, log the failure and try lookup_order_cache.
3. If the cache also fails, fall back to lookup_order_manual.
Always tell the customer which source provided the result and note any fallbacks used.
Be transparent about service degradation but reassuring in tone.""",
    tools=[
        FunctionTool(lookup_order_primary_db),
        FunctionTool(lookup_order_cache),
        FunctionTool(lookup_order_manual),
        FunctionTool(get_repair_log)
    ]
)

Now run the agent with an order that triggers the self-healing chain,

It gives the following output,

Looking up: ORD-9901
I attempted to retrieve your order from our primary database but encountered a connection
timeout (our systems are experiencing some load right now). I automatically switched to
our fast cache layer and found your order:

Order ORD-9901: SHIPPED - Expected delivery tomorrow.
(Note: Retrieved from cache. Real-time tracking may show more detail.)

Looking up: ORD-0000
I tried our primary database (timeout) and the cache layer (not found) for order ORD-0000.
I have automatically raised a manual lookup ticket:

Ticket ID: MANUAL-0000
Our support team will update you within 15 minutes.
We apologise for the inconvenience - our systems are recovering and this order requires
a manual check. You will receive a notification shortly.

The self-healing pattern applies beyond order lookup - use it for any agent that calls external APIs. Define a clear tier hierarchy: primary (live DB) - secondary (cache) - tertiary (queue for async processing) - final (human escalation). Log every failure and recovery to a monitoring system like Cloud Monitoring or Datadog to track which tier is being hit most often as an early signal of infrastructure issues.

Send your comments, suggestions or queries regarding this site to [email protected].