In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > ADK Production Incident Response

ADK Production Incident Response

Author: Venkata Sudhakar

Even well-tested ADK agents encounter production incidents - model API outages, tool failures, runaway token costs, or sudden traffic spikes. ShopMax India maintains a structured incident response process so that on-call engineers can diagnose and mitigate agent issues in minutes rather than hours. The process covers detection via alerts, diagnosis using logs and metrics, mitigation through rollback or circuit breaking, and a post-mortem to prevent recurrence.

The incident response runbook integrates Cloud Monitoring alerts, Cloud Logging queries, and gcloud commands that can be run from any terminal. Key diagnostic steps include checking agent error rates, inspecting recent log entries for exception traces, reviewing token usage for cost spikes, and verifying downstream tool API health. Mitigation options range from rolling back to a previous revision to disabling a faulty tool and redeploying.

The below example shows the diagnostic script and key gcloud commands used in the ShopMax incident runbook.

# incident_diag.py - Run this first during any ADK agent incident
from google.cloud import logging as cloud_logging
from google.cloud import monitoring_v3
from datetime import datetime, timedelta
import os

PROJECT_ID = os.environ["GOOGLE_CLOUD_PROJECT"]
SERVICE    = "shopmax-support-agent"
REGION     = "asia-south1"

def recent_errors(minutes: int = 10) -> list:
    """Fetch recent ERROR and CRITICAL log entries from the agent."""
    client = cloud_logging.Client(project=PROJECT_ID)
    since = (datetime.utcnow() - timedelta(minutes=minutes)).strftime("%Y-%m-%dT%H:%M:%SZ")
    filter_str = (
        f'resource.type="cloud_run_revision" '
        f'resource.labels.service_name="{SERVICE}" '
        f'severity>="ERROR" '
        f'timestamp>="{since}"'
    )
    entries = list(client.list_entries(filter_=filter_str, max_results=20))
    return [{"ts": e.timestamp.isoformat(), "msg": str(e.payload)[:120]} for e in entries]

def print_incident_summary():
    print(f"=== INCIDENT DIAGNOSTICS: {SERVICE} ===")
    print(f"Time: {datetime.utcnow().isoformat()} UTC")
    print(f"Region: {REGION}")
    print()
    errors = recent_errors(minutes=10)
    if errors:
        print(f"RECENT ERRORS ({len(errors)} in last 10 min):")
        for e in errors[:5]:
            print(f"  [{e['ts'][:19]}] {e['msg']}")
    else:
        print("No ERROR/CRITICAL logs in last 10 minutes.")
    print()
    print("NEXT STEPS:")
    print("  1. Check Cloud Run revision traffic split: gcloud run services describe " + SERVICE + " --region=" + REGION)
    print("  2. Rollback if needed: gcloud run services update-traffic " + SERVICE + " --to-revisions=PREV=100 --region=" + REGION)
    print("  3. Check Gemini API status: https://status.cloud.google.com")
    print("  4. Scale up min-instances if traffic spike: gcloud run services update " + SERVICE + " --min-instances=10 --region=" + REGION)

print_incident_summary()

It gives the following output,

=== INCIDENT DIAGNOSTICS: shopmax-support-agent ===
Time: 2026-04-06T14:32:10 UTC
Region: asia-south1

RECENT ERRORS (3 in last 10 min):
  [2026-04-06T14:31:44] ConnectionError: Inventory API timeout after 30s
  [2026-04-06T14:31:55] ConnectionError: Inventory API timeout after 30s
  [2026-04-06T14:32:08] ConnectionError: Inventory API timeout after 30s

NEXT STEPS:
  1. Check Cloud Run revision traffic split: gcloud run services describe ...
  2. Rollback if needed: gcloud run services update-traffic ...
  3. Check Gemini API status: https://status.cloud.google.com
  4. Scale up min-instances if traffic spike: gcloud run services update ...

It gives the following output,

REVISION NAME                          TRAFFIC
shopmax-support-agent-00043-canary     100%

Traffic updated:
  shopmax-support-agent-00041-stable: 100%
  shopmax-support-agent-00043-canary:   0%

Rollback complete. Incident mitigated.

After every incident, hold a 30-minute blameless post-mortem within 48 hours. Document the timeline, root cause, customer impact, and action items in a shared Google Doc. Common action items for ShopMax agents include adding circuit breakers for flaky tools, improving probe thresholds, adding runbook steps for new failure modes, and setting up proactive budget alerts. Store all post-mortems in a Drive folder linked from the incident management sheet so the team builds institutional knowledge over time.

Send your comments, suggestions or queries regarding this site to [email protected].