In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > ADK with GKE for Large-Scale Agent Deployments

ADK with GKE for Large-Scale Agent Deployments

Author: Venkata Sudhakar

When ADK agents need to serve thousands of concurrent users, a single Cloud Run instance is insufficient. Google Kubernetes Engine (GKE) provides the infrastructure layer for horizontally scaling agent services, managing multiple agent replicas, and implementing advanced deployment patterns like canary releases and blue-green deployments.

ShopMax India deploys its customer service agent fleet on GKE with autoscaling enabled. During Diwali sales, traffic spikes 15x. With GKE, the agent pods scale automatically from 3 replicas to 45 within minutes, handling the load without any manual intervention. After the sale, pods scale back down to save costs.

The below example shows a complete GKE deployment configuration for an ADK agent service with horizontal pod autoscaling.

# adk-agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: shopmax-returns-agent
  namespace: agents
spec:
  replicas: 3
  selector:
    matchLabels:
      app: returns-agent
  template:
    metadata:
      labels:
        app: returns-agent
    spec:
      serviceAccountName: adk-agent-sa
      containers:
      - name: agent
        image: asia-south1-docker.pkg.dev/shopmax-india/agents/returns-agent:latest
        ports:
        - containerPort: 8080
        env:
        - name: GOOGLE_CLOUD_PROJECT
          value: "shopmax-india"
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: returns-agent-hpa
  namespace: agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: shopmax-returns-agent
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

It gives the following output,

deployment.apps/shopmax-returns-agent created
horizontalpodautoscaler.autoscaling/returns-agent-hpa created

$ kubectl get pods -n agents
NAME                                   READY   STATUS    RESTARTS
shopmax-returns-agent-6d8f9c4-2xkqp   1/1     Running   0
shopmax-returns-agent-6d8f9c4-8mnvr   1/1     Running   0
shopmax-returns-agent-6d8f9c4-p9wzt   1/1     Running   0

The below example shows the FastAPI health endpoint and agent handler that runs inside each GKE pod.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from google.adk.agents import LlmAgent
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
import vertexai, os

app = FastAPI()
PROJECT = os.environ.get("GOOGLE_CLOUD_PROJECT", "shopmax-india")
vertexai.init(project=PROJECT, location="asia-south1")

# Initialise agent once at startup - not per request
agent = LlmAgent(
    model="gemini-2.0-flash",
    name="returns_agent",
    instruction="You are a returns specialist for ShopMax India. "
        "Process return requests and check 30-day eligibility."
)
session_service = InMemorySessionService()
runner = Runner(agent=agent, app_name="shopmax_returns",
                session_service=session_service)

class ChatRequest(BaseModel):
    user_id: str
    message: str

@app.get("/health")
def health():
    return {"status": "ok", "agent": "returns_agent"}

@app.post("/chat")
def chat(req: ChatRequest):
    session = session_service.create_session(
        app_name="shopmax_returns", user_id=req.user_id
    )
    result = None
    for event in runner.run(
        user_id=req.user_id,
        session_id=session.id,
        new_message={"role": "user", "parts": [{"text": req.message}]}
    ):
        if event.is_final_response():
            result = event.content.parts[0].text
    return {"response": result, "user_id": req.user_id}

It gives the following output,

GET /health
{"status": "ok", "agent": "returns_agent"}

POST /chat {"user_id": "CUST-4421", "message": "Return order ORD-7823"}
{"response": "Return initiated for ORD-7823. Refund of Rs 12,999 credited in 5 days.",
 "user_id": "CUST-4421"}

GKE gives ShopMax India enterprise-grade reliability for its agent fleet. The agent is initialised once per pod at startup rather than per request - this eliminates cold-start overhead from repeated model client creation. With HPA at 60% CPU, the cluster scales proactively before agents become overloaded, maintaining consistent response times across peak and off-peak periods.

Send your comments, suggestions or queries regarding this site to [email protected].