tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > Google Gemini API > ADK with Cloud Run Auto-Scaling

ADK with Cloud Run Auto-Scaling

Author: Venkata Sudhakar

ADK agents deployed on Cloud Run scale automatically based on incoming request volume. ShopMax India sees traffic spikes during sale events - the support agent may receive 10x normal volume on festival days. Without correct auto-scaling configuration, requests queue or time out; with it, new instances spin up in seconds and scale back down when traffic drops to save cost.

The key Cloud Run settings for ADK agents are: max-instances (hard cap on parallel instances), concurrency (requests served per instance), min-instances (keep-warm instances to avoid cold starts), and CPU/memory allocation. ADK agents that call the Gemini API are I/O bound - each request spends most of its time waiting for the model response - so a concurrency of 10-20 per instance works well without CPU contention.

The below example shows the deployment command and a simple load test to verify scaling behaviour.


It gives the following output,

Deploying container to Cloud Run service [shopmax-support-agent] in project [shopmax-prod]...
OK Deploying new service... Done.
  OK Creating Revision...
  OK Routing traffic...
Service [shopmax-support-agent] deployed to region [asia-south1].

Scaling annotations:
  autoscaling.knative.dev/minScale: 2
  autoscaling.knative.dev/maxScale: 50
  autoscaling.knative.dev/target: 15

It gives the following output,

Success: 100/100
Avg latency: 1842 ms
Max latency: 3210 ms

# Cloud Run console shows instances scaled from 2 to 9 during the test

For festival sale events where ShopMax expects sudden traffic spikes, pre-warm instances by temporarily raising min-instances to 10 one hour before the event, then lower it back after peak. Use Cloud Run traffic management to keep a warm revision active and use Cloud Monitoring to create an alerting policy that fires when instance count approaches max-instances, giving the team time to raise the limit if needed.


 
  


  
bl  br