|
|
ADK Production Incident Response
Author: Venkata Sudhakar
Even well-tested ADK agents encounter production incidents - model API outages, tool failures, runaway token costs, or sudden traffic spikes. ShopMax India maintains a structured incident response process so that on-call engineers can diagnose and mitigate agent issues in minutes rather than hours. The process covers detection via alerts, diagnosis using logs and metrics, mitigation through rollback or circuit breaking, and a post-mortem to prevent recurrence.
The incident response runbook integrates Cloud Monitoring alerts, Cloud Logging queries, and gcloud commands that can be run from any terminal. Key diagnostic steps include checking agent error rates, inspecting recent log entries for exception traces, reviewing token usage for cost spikes, and verifying downstream tool API health. Mitigation options range from rolling back to a previous revision to disabling a faulty tool and redeploying.
The below example shows the diagnostic script and key gcloud commands used in the ShopMax incident runbook.
It gives the following output,
=== INCIDENT DIAGNOSTICS: shopmax-support-agent ===
Time: 2026-04-06T14:32:10 UTC
Region: asia-south1
RECENT ERRORS (3 in last 10 min):
[2026-04-06T14:31:44] ConnectionError: Inventory API timeout after 30s
[2026-04-06T14:31:55] ConnectionError: Inventory API timeout after 30s
[2026-04-06T14:32:08] ConnectionError: Inventory API timeout after 30s
NEXT STEPS:
1. Check Cloud Run revision traffic split: gcloud run services describe ...
2. Rollback if needed: gcloud run services update-traffic ...
3. Check Gemini API status: https://status.cloud.google.com
4. Scale up min-instances if traffic spike: gcloud run services update ...
It gives the following output,
REVISION NAME TRAFFIC
shopmax-support-agent-00043-canary 100%
Traffic updated:
shopmax-support-agent-00041-stable: 100%
shopmax-support-agent-00043-canary: 0%
Rollback complete. Incident mitigated.
After every incident, hold a 30-minute blameless post-mortem within 48 hours. Document the timeline, root cause, customer impact, and action items in a shared Google Doc. Common action items for ShopMax agents include adding circuit breakers for flaky tools, improving probe thresholds, adding runbook steps for new failure modes, and setting up proactive budget alerts. Store all post-mortems in a Drive folder linked from the incident management sheet so the team builds institutional knowledge over time.
|
|