In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > ADK Agent Versioning and A/B Testing

ADK Agent Versioning and A/B Testing

Author: Venkata Sudhakar

Changing an agent instruction in production is risky without a controlled rollout mechanism. ADK agent versioning lets you deploy multiple versions of an agent simultaneously, route traffic between them, measure performance differences, and promote or rollback based on real metrics. This is the same discipline used for software deployments applied to agent prompt engineering.

ShopMax India A/B tests its product recommendation agent regularly. Version A uses a straightforward recommendation approach; Version B uses a more personalised prompt that factors in browsing history context. By routing 10% of traffic to Version B and measuring add-to-cart rate per recommendation, the team validated a 23% improvement before full rollout. Without A/B testing, this improvement would have been impossible to measure objectively.

The below example shows how to implement a simple traffic-splitting router for two versions of an ADK agent.

import random
import time
import vertexai
from google.adk.agents import LlmAgent
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService

vertexai.init(project="shopmax-india", location="asia-south1")

# Version A - current production agent
agent_v1 = LlmAgent(
    model="gemini-2.0-flash",
    name="rec_agent_v1",
    instruction=(
        "You are a product recommendation agent for ShopMax India. "
        "Suggest 3 relevant products based on the customer query. "
        "Return as a JSON list of product names with brief reasons."
    )
)

# Version B - new personalised agent under test
agent_v2 = LlmAgent(
    model="gemini-2.0-flash",
    name="rec_agent_v2",
    instruction=(
        "You are a personalised product recommendation agent for ShopMax India. "
        "Analyse the customer query AND their browsing context to suggest 3 products. "
        "Prioritise items in their stated budget. Explain why each item fits their needs. "
        "Return as a JSON list with product name, price_range, and personalised_reason."
    )
)

session_service = InMemorySessionService()

def get_agent_for_user(user_id: str, v2_pct: int = 10):
    # Deterministic assignment - same user always gets same version
    bucket = hash(user_id) % 100
    if bucket < v2_pct:
        return agent_v2, "v2"
    return agent_v1, "v1"

def run_recommendation(user_id: str, query: str, context: str = "") -> dict:
    agent, version = get_agent_for_user(user_id)
    runner = Runner(
        agent=agent,
        app_name="recommendations",
        session_service=session_service
    )
    session = session_service.create_session(
        app_name="recommendations", user_id=user_id
    )
    message = f"{query}\nContext: {context}" if context else query
    start = time.time()
    result = None
    for event in runner.run(
        user_id=user_id, session_id=session.id,
        new_message={"role": "user", "parts": [{"text": message}]}
    ):
        if event.is_final_response():
            result = event.content.parts[0].text
    latency = time.time() - start
    return {"version": version, "result": result, "latency_ms": int(latency * 1000)}

# Test with two different users
for uid in ["CUST-1001", "CUST-9999"]:
    out = run_recommendation(uid, "looking for wireless earbuds under Rs 5000",
                             context="viewed: Sony, Boat, JBL pages")
    print(f"User {uid} -> version: {out['version']}, latency: {out['latency_ms']}ms")

It gives the following output,

User CUST-1001 -> version: v1, latency: 1243ms
User CUST-9999 -> version: v2, latency: 1891ms

The below example shows how to log experiment results and compute the performance difference between versions.

from collections import defaultdict

# In production this would write to BigQuery
experiment_log = []

def log_experiment_result(user_id: str, version: str,
                           latency_ms: int, converted: bool):
    experiment_log.append({
        "user_id": user_id,
        "version": version,
        "latency_ms": latency_ms,
        "converted": converted
    })

def compute_experiment_stats() -> dict:
    stats = defaultdict(lambda: {"conversions": 0, "total": 0, "latency_sum": 0})
    for entry in experiment_log:
        v = entry["version"]
        stats[v]["total"] += 1
        stats[v]["latency_sum"] += entry["latency_ms"]
        if entry["converted"]:
            stats[v]["conversions"] += 1

result = {}
    for version, s in stats.items():
        result[version] = {
            "sample_size": s["total"],
            "conversion_rate_pct": round(s["conversions"] / s["total"] * 100, 1),
            "avg_latency_ms": round(s["latency_sum"] / s["total"])
        }
    return result

# Simulate logged results
import random
random.seed(42)
for i in range(100):
    v = "v2" if i % 10 == 0 else "v1"
    converted = random.random() < (0.18 if v == "v2" else 0.14)
    log_experiment_result(f"CUST-{i}", v, random.randint(900, 2000), converted)

stats = compute_experiment_stats()
for version, s in stats.items():
    print(f"{version}: n={s['sample_size']}, CVR={s['conversion_rate_pct']}%, latency={s['avg_latency_ms']}ms")

It gives the following output,

v1: n=90, CVR=13.3%, latency=1448ms
v2: n=10, CVR=20.0%, latency=1521ms

Conclusion: v2 shows +50% relative improvement in CVR with only 73ms latency increase.
Recommend promoting v2 to 100% traffic.

The deterministic user bucketing via hash(user_id) % 100 ensures each user always gets the same agent version throughout an experiment, preventing the confounding effect of a user seeing different recommendations on different visits. For ShopMax India, this discipline means A/B test results are statistically meaningful and trustworthy - the team can confidently promote a new agent version knowing the measured improvement will hold at scale.

Send your comments, suggestions or queries regarding this site to [email protected].