|
|
ADK with Cloud Run Auto-Scaling
Author: Venkata Sudhakar
ADK agents deployed on Cloud Run scale automatically based on incoming request volume. ShopMax India sees traffic spikes during sale events - the support agent may receive 10x normal volume on festival days. Without correct auto-scaling configuration, requests queue or time out; with it, new instances spin up in seconds and scale back down when traffic drops to save cost.
The key Cloud Run settings for ADK agents are: max-instances (hard cap on parallel instances), concurrency (requests served per instance), min-instances (keep-warm instances to avoid cold starts), and CPU/memory allocation. ADK agents that call the Gemini API are I/O bound - each request spends most of its time waiting for the model response - so a concurrency of 10-20 per instance works well without CPU contention.
The below example shows the deployment command and a simple load test to verify scaling behaviour.
It gives the following output,
Deploying container to Cloud Run service [shopmax-support-agent] in project [shopmax-prod]...
OK Deploying new service... Done.
OK Creating Revision...
OK Routing traffic...
Service [shopmax-support-agent] deployed to region [asia-south1].
Scaling annotations:
autoscaling.knative.dev/minScale: 2
autoscaling.knative.dev/maxScale: 50
autoscaling.knative.dev/target: 15
It gives the following output,
Success: 100/100
Avg latency: 1842 ms
Max latency: 3210 ms
# Cloud Run console shows instances scaled from 2 to 9 during the test
For festival sale events where ShopMax expects sudden traffic spikes, pre-warm instances by temporarily raising min-instances to 10 one hour before the event, then lower it back after peak. Use Cloud Run traffic management to keep a warm revision active and use Cloud Monitoring to create an alerting policy that fires when instance count approaches max-instances, giving the team time to raise the limit if needed.
|
|