In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > Agent Engine Monitoring and Logging

Agent Engine Monitoring and Logging

Author: Venkata Sudhakar

When your ADK agent is deployed to Vertex AI Agent Engine, every request is automatically logged to Google Cloud Logging and emits metrics to Cloud Monitoring. You get full observability out of the box with no instrumentation code. Every user message, tool call, model response, and error is captured as a structured log entry. Cloud Monitoring dashboards show request rates, latency percentiles, error rates, and token usage over time - the same enterprise observability stack that powers all Google Cloud services.

Agent Engine logs appear under resource type aiplatform.googleapis.com/ReasoningEngine in Cloud Logging. Each log entry contains session_id, user_id, event type, and full content. Query logs using gcloud CLI, the Logs Explorer UI, or the Cloud Logging Python client. Create Cloud Monitoring alert policies on error_rate to get paged when failures spike. Export logs to BigQuery for long-term analytics on usage patterns, popular queries, and tool call frequencies across your entire user base.

The below example shows the key monitoring workflows: querying logs via gcloud and Python, computing daily metrics, and creating an automated daily health report for the operations team.

Python Cloud Logging client for programmatic metrics extraction,

from google.cloud import logging as cloud_logging
from datetime import datetime, timedelta, timezone

PROJECT   = "your-gcp-project"
ENGINE_ID = "your-reasoning-engine-id"

def get_daily_agent_metrics() -> dict:
    client = cloud_logging.Client(project=PROJECT)
    filter_str = (
        "resource.type=\"aiplatform.googleapis.com/ReasoningEngine\" "
        "AND resource.labels.reasoning_engine_id=\"" + ENGINE_ID + "\""
    )
    entries = list(client.list_entries(filter_=filter_str, max_results=1000))
    stats = {
        "total_events": len(entries),
        "errors": 0, "tool_calls": 0,
        "unique_sessions": set(),
        "tool_frequency": {}
    }
    for e in entries:
        p = e.payload if isinstance(e.payload, dict) else {}
        if e.severity and e.severity.name in ["ERROR", "CRITICAL"]:
            stats["errors"] += 1
        if p.get("event_type") == "tool_call":
            stats["tool_calls"] += 1
            t = p.get("tool_name", "unknown")
            stats["tool_frequency"][t] = stats["tool_frequency"].get(t, 0) + 1
        if p.get("session_id"):
            stats["unique_sessions"].add(p["session_id"])
    stats["unique_sessions"] = len(stats["unique_sessions"])
    stats["error_rate_pct"]  = round(
        stats["errors"] / max(stats["total_events"], 1) * 100, 2
    )
    return stats

m = get_daily_agent_metrics()
print("=== AGENT ENGINE DAILY HEALTH REPORT ===")
print("Total log events: ", m["total_events"])
print("Unique sessions:  ", m["unique_sessions"])
print("Tool calls:       ", m["tool_calls"])
print("Errors:           ", m["errors"])
print("Error rate:       ", str(m["error_rate_pct"]) + "%")
print("\nTool usage breakdown:")
for tool, cnt in sorted(m["tool_frequency"].items(), key=lambda x: -x[1]):
    print("  " + tool + ": " + str(cnt))
status = "HEALTHY" if m["error_rate_pct"] < 1.0 else "DEGRADED - investigate errors"
print("\nOverall status: " + status)

It gives the following output showing the daily health report,

=== AGENT ENGINE DAILY HEALTH REPORT ===
Total log events:  1842
Unique sessions:   324
Tool calls:        891
Errors:            7
Error rate:        0.38%

Tool usage breakdown:
  get_order_status:      412
  check_availability:    287
  estimate_delivery:     192

Overall status: HEALTHY

# Error rate below 1% - green
# get_order_status most used tool - prioritise its reliability
# 324 sessions today - compare to yesterday for trend

Alerts notify the ops team automatically when error rate spikes above threshold,

Created notification channel [projects/your-gcp-project/notificationChannels/123456]
Created alert policy [projects/your-gcp-project/alertPolicies/789012]

# When error rate exceeds 5% for 5 consecutive minutes:
# - Email sent to [email protected]
# - Alert appears in Cloud Monitoring dashboard
# - Auto-closes after 1 hour if error rate drops

# Key metrics to monitor in production:
# - request_count by response_code (track 200 vs errors)
# - request_latencies p50/p95/p99 (detect slowdowns)
# - active_sessions count (scale planning)

Monitoring best practices for Agent Engine: set alerts for error_rate above 1% (warning) and 5% (critical). Track p95 latency - if it exceeds 5 seconds users perceive the agent as slow. Use Cloud Logging log-based metrics to count specific event types like tool call failures or escalations - these business metrics are more meaningful than generic error counts. Export logs to BigQuery weekly for trend analysis: which questions are most common, which tools are most called, which sessions take the most turns to resolve. Use these insights to improve your agent instruction and add missing tools. Retention: Agent Engine keeps logs in Cloud Logging for 30 days by default; for compliance requirements create a log sink to Cloud Storage or BigQuery for longer retention.

Send your comments, suggestions or queries regarding this site to [email protected].