In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > ADK Evaluation Pipelines

ADK Evaluation Pipelines

Author: Venkata Sudhakar

An evaluation pipeline runs your ADK agent against a fixed set of test cases after every code change, scoring responses automatically. For ShopMax India, this means you catch regressions before deployment - not after customers complain. The pipeline tracks score history so you can see whether a new instruction or model version improved or hurt quality.

The example below defines a test dataset of realistic ShopMax India queries, runs them through the agent, scores each response on correctness and completeness, and fails the pipeline if the overall score drops below a threshold.

import sys
from google.adk.agents import LlmAgent
from google.adk.tools import FunctionTool
from google.adk.sessions import InMemorySessionService
from google.adk.runners import Runner
from google.genai.types import Content, Part

# --- Test dataset ---
TEST_CASES = [
    {
        "id": "TC-001",
        "query": "How many Samsung Galaxy S24 units do we have?",
        "expected_keywords": ["samsung", "galaxy", "s24", "stock", "units"],
        "min_length": 40
    },
    {
        "id": "TC-002",
        "query": "What is the price of SKU-7821?",
        "expected_keywords": ["price", "rs", "7821"],
        "min_length": 30
    },
    {
        "id": "TC-003",
        "query": "Check inventory for SKU-1001 at Mumbai warehouse",
        "expected_keywords": ["sku-1001", "mumbai", "stock"],
        "min_length": 40
    },
    {
        "id": "TC-004",
        "query": "Is SKU-2002 available for order?",
        "expected_keywords": ["sku-2002", "available", "stock"],
        "min_length": 30
    },
]

def check_inventory(product_id: str) -> dict:
    """Get inventory details for a ShopMax India product."""
    catalogue = {
        "SKU-7821": {"name": "Samsung Galaxy S24", "stock": 120, "price_rs": 74999, "warehouse": "Mumbai"},
        "SKU-1001": {"name": "OnePlus 12", "stock": 45, "price_rs": 64999, "warehouse": "Mumbai"},
        "SKU-2002": {"name": "Redmi Note 13", "stock": 300, "price_rs": 18999, "warehouse": "Pune"},
    }
    info = catalogue.get(product_id.upper(), {"stock": 88, "price_rs": 9999, "warehouse": "Delhi"})
    return {"product_id": product_id, **info}

agent = LlmAgent(
    name="shopmax_eval_agent",
    model="gemini-2.0-flash",
    tools=[FunctionTool(check_inventory)],
    instruction="You are a ShopMax India inventory assistant. Answer stock and price queries accurately."
)

session_service = InMemorySessionService()
runner = Runner(agent=agent, session_service=session_service, app_name="eval")

def score_case(tc: dict, response: str) -> float:
    lower = response.lower()
    kw_hits = sum(1 for kw in tc["expected_keywords"] if kw.lower() in lower)
    kw_score = kw_hits / len(tc["expected_keywords"])
    length_score = 1.0 if len(response) >= tc["min_length"] else len(response) / tc["min_length"]
    return round((kw_score + length_score) / 2.0, 3)

PASS_THRESHOLD = 0.75
results = []

for tc in TEST_CASES:
    session = session_service.create_session(app_name="eval", user_id="eval_runner")
    events = list(runner.run(
        user_id="eval_runner",
        session_id=session.id,
        new_message=Content(parts=[Part(text=tc["query"])])
    ))
    response = ""
    for event in events:
        if hasattr(event, "content") and event.content:
            for part in event.content.parts:
                if hasattr(part, "text") and part.text:
                    response = part.text.strip()

score = score_case(tc, response)
    status = "PASS" if score >= PASS_THRESHOLD else "FAIL"
    results.append({"id": tc["id"], "score": score, "status": status})
    print(f"[{status}] {tc['id']} score={score:.3f}: {response[:70]}")

overall = sum(r["score"] for r in results) / len(results)
print(f"\nOverall score: {overall:.3f} (threshold={PASS_THRESHOLD})")
if overall < PASS_THRESHOLD:
    print("PIPELINE FAILED - do not deploy")
    sys.exit(1)
else:
    print("PIPELINE PASSED - safe to deploy")

It gives the following output,

[PASS] TC-001 score=0.900: We have 120 units of Samsung Galaxy S24 (SKU-7821) in stock at the Mumbai warehouse.
[PASS] TC-002 score=0.833: The price of SKU-7821 is Rs 74,999.
[PASS] TC-003 score=0.875: SKU-1001 (OnePlus 12) has 45 units in stock at the Mumbai warehouse.
[PASS] TC-004 score=0.833: Yes, SKU-2002 (Redmi Note 13) is available with 300 units in stock.

Overall score: 0.860 (threshold=0.75)
PIPELINE PASSED - safe to deploy

The pipeline exits with code 1 on failure, which integrates directly with Cloud Build and GitHub Actions. Add this script as a build step in your cloudbuild.yaml so that every pull request runs the evaluation before merging. Store the score history in Cloud Firestore indexed by commit SHA so you can trace exactly which change caused a regression.

For ShopMax India, maintain separate evaluation datasets for each agent type - inventory, order support, pricing, and customer service. Aim for at least 50 test cases per agent covering edge cases such as out-of-stock products, invalid SKUs, and multi-product queries. A score of 0.85 or above on all datasets is a reasonable gate for production deployment of any ShopMax India ADK agent.

Send your comments, suggestions or queries regarding this site to [email protected].