tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > Google Gemini API > ADK Shadow Mode Evaluation

ADK Shadow Mode Evaluation

Author: Venkata Sudhakar

Shadow mode evaluation lets you run a candidate agent in parallel with your production agent, capturing its outputs for analysis without exposing users to potentially worse responses. For ShopMax India, this is the safest way to test a new model version or updated instructions before a full rollout.

The shadow runner receives the same input as production, generates a response, but discards it from the user-facing output. Both responses are logged for offline comparison using quality metrics or human review.


It gives the following output,

[User u101] -> Your order ORD-441 has been dispatched and is expected to arrive in 2 days in Bangalore.
[User u102] -> Your order ORD-552 has been dispatched and is expected to arrive in 2 days in Bangalore.
[User u103] -> Your order ORD-663 has been dispatched and is expected to arrive in 2 days in Bangalore.

--- Shadow Log (for review) ---
Query: What is the status of my order ORD-441?
  PROD:   Your order ORD-441 has been dispatched and is expected to arrive in 2 days in Bangalore.
  SHADOW: Status: Dispatched. Your order ORD-441 will arrive in Bangalore in 2 days.
          Need help? Contact ShopMax India support at any time.

Query: What is the status of my order ORD-552?
  PROD:   Your order ORD-552 has been dispatched...
  SHADOW: Status: Dispatched. Your order ORD-552 will arrive in Bangalore in 2 days...

The shadow log shows both responses side by side. The shadow agent adds a support contact offer as instructed, making responses slightly longer but more helpful. After reviewing 1,000 shadow comparisons, ShopMax India can decide whether to promote the shadow agent to production based on human reviewer ratings or automated quality scoring.

In production, write shadow logs to BigQuery instead of an in-memory list. Use a BigQuery table with columns for timestamp, user_id, query, prod_response, and shadow_response. Then run SQL queries to compare response lengths, keyword presence, and sentiment scores across thousands of real customer interactions before making the promotion decision.


 
  


  
bl  br