|
|
ADK Evaluation Pipelines
Author: Venkata Sudhakar
An evaluation pipeline runs your ADK agent against a fixed set of test cases after every code change, scoring responses automatically. For ShopMax India, this means you catch regressions before deployment - not after customers complain. The pipeline tracks score history so you can see whether a new instruction or model version improved or hurt quality.
The example below defines a test dataset of realistic ShopMax India queries, runs them through the agent, scores each response on correctness and completeness, and fails the pipeline if the overall score drops below a threshold.
It gives the following output,
[PASS] TC-001 score=0.900: We have 120 units of Samsung Galaxy S24 (SKU-7821) in stock at the Mumbai warehouse.
[PASS] TC-002 score=0.833: The price of SKU-7821 is Rs 74,999.
[PASS] TC-003 score=0.875: SKU-1001 (OnePlus 12) has 45 units in stock at the Mumbai warehouse.
[PASS] TC-004 score=0.833: Yes, SKU-2002 (Redmi Note 13) is available with 300 units in stock.
Overall score: 0.860 (threshold=0.75)
PIPELINE PASSED - safe to deploy
The pipeline exits with code 1 on failure, which integrates directly with Cloud Build and GitHub Actions. Add this script as a build step in your cloudbuild.yaml so that every pull request runs the evaluation before merging. Store the score history in Cloud Firestore indexed by commit SHA so you can trace exactly which change caused a regression.
For ShopMax India, maintain separate evaluation datasets for each agent type - inventory, order support, pricing, and customer service. Aim for at least 50 test cases per agent covering edge cases such as out-of-stock products, invalid SKUs, and multi-product queries. A score of 0.85 or above on all datasets is a reasonable gate for production deployment of any ShopMax India ADK agent.
|
|