|
|
ADK A/B Testing Agents
Author: Venkata Sudhakar
A/B testing lets you compare two versions of an ADK agent before committing to one in production. For ShopMax India, you might want to test whether a more detailed system instruction produces better customer responses, or whether a different model (gemini-2.0-flash vs gemini-1.5-pro) gives higher quality answers at acceptable latency.
The pattern below routes each request to either Agent A or Agent B based on a hash of the user ID, ensuring consistent assignment per user. Metrics are collected per variant so you can compare them after the test period.
It gives the following output,
[Variant A] User u001: 1842ms
Response: Samsung Galaxy S24 is available at Rs 74,999 with 45 units in stock...
[Variant B] User u002: 1956ms
Response: Great news! The Samsung Galaxy S24 is priced at Rs 74,999 and we have
45 units in stock. Would you like help placing an order?...
[Variant A] User u003: 1798ms
Response: Samsung Galaxy S24 is available at Rs 74,999 with 45 units in stock...
[Variant B] User u004: 2011ms
Response: Great news! The Samsung Galaxy S24 is priced at Rs 74,999...
[Variant A] User u005: 1821ms
Response: Samsung Galaxy S24 is available at Rs 74,999...
[Variant B] User u006: 1978ms
Response: Great news! The Samsung Galaxy S24 is priced at Rs 74,999...
--- A/B Test Summary ---
Variant A: 3 requests, avg latency 1820 ms
Variant B: 3 requests, avg latency 1982 ms
Variant B produces more friendly, sales-oriented responses but costs an extra 162 ms on average - the longer instruction increases prompt tokens processed by the model. For ShopMax India, this is an acceptable trade-off if the conversion rate improves. Track click-through-to-purchase rates alongside latency to make the final decision.
In production, replace InMemorySessionService with Firestore-backed sessions so that the same user always gets the same variant across browser sessions. Store variant assignments in a separate collection for analysis, and use Cloud Monitoring dashboards to visualise the latency and quality differences over time.
|
|