tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > Google Gemini API > ADK Production Incident Response

ADK Production Incident Response

Author: Venkata Sudhakar

Even well-tested ADK agents encounter production incidents - model API outages, tool failures, runaway token costs, or sudden traffic spikes. ShopMax India maintains a structured incident response process so that on-call engineers can diagnose and mitigate agent issues in minutes rather than hours. The process covers detection via alerts, diagnosis using logs and metrics, mitigation through rollback or circuit breaking, and a post-mortem to prevent recurrence.

The incident response runbook integrates Cloud Monitoring alerts, Cloud Logging queries, and gcloud commands that can be run from any terminal. Key diagnostic steps include checking agent error rates, inspecting recent log entries for exception traces, reviewing token usage for cost spikes, and verifying downstream tool API health. Mitigation options range from rolling back to a previous revision to disabling a faulty tool and redeploying.

The below example shows the diagnostic script and key gcloud commands used in the ShopMax incident runbook.


It gives the following output,

=== INCIDENT DIAGNOSTICS: shopmax-support-agent ===
Time: 2026-04-06T14:32:10 UTC
Region: asia-south1

RECENT ERRORS (3 in last 10 min):
  [2026-04-06T14:31:44] ConnectionError: Inventory API timeout after 30s
  [2026-04-06T14:31:55] ConnectionError: Inventory API timeout after 30s
  [2026-04-06T14:32:08] ConnectionError: Inventory API timeout after 30s

NEXT STEPS:
  1. Check Cloud Run revision traffic split: gcloud run services describe ...
  2. Rollback if needed: gcloud run services update-traffic ...
  3. Check Gemini API status: https://status.cloud.google.com
  4. Scale up min-instances if traffic spike: gcloud run services update ...

It gives the following output,

REVISION NAME                          TRAFFIC
shopmax-support-agent-00043-canary     100%

Traffic updated:
  shopmax-support-agent-00041-stable: 100%
  shopmax-support-agent-00043-canary:   0%

Rollback complete. Incident mitigated.

After every incident, hold a 30-minute blameless post-mortem within 48 hours. Document the timeline, root cause, customer impact, and action items in a shared Google Doc. Common action items for ShopMax agents include adding circuit breakers for flaky tools, improving probe thresholds, adding runbook steps for new failure modes, and setting up proactive budget alerts. Store all post-mortems in a Drive folder linked from the incident management sheet so the team builds institutional knowledge over time.


 
  


  
bl  br