|
|
Agent Engine Monitoring and Logging
Author: Venkata Sudhakar
When your ADK agent is deployed to Vertex AI Agent Engine, every request is automatically logged to Google Cloud Logging and emits metrics to Cloud Monitoring. You get full observability out of the box with no instrumentation code. Every user message, tool call, model response, and error is captured as a structured log entry. Cloud Monitoring dashboards show request rates, latency percentiles, error rates, and token usage over time - the same enterprise observability stack that powers all Google Cloud services. Agent Engine logs appear under resource type aiplatform.googleapis.com/ReasoningEngine in Cloud Logging. Each log entry contains session_id, user_id, event type, and full content. Query logs using gcloud CLI, the Logs Explorer UI, or the Cloud Logging Python client. Create Cloud Monitoring alert policies on error_rate to get paged when failures spike. Export logs to BigQuery for long-term analytics on usage patterns, popular queries, and tool call frequencies across your entire user base. The below example shows the key monitoring workflows: querying logs via gcloud and Python, computing daily metrics, and creating an automated daily health report for the operations team.
Python Cloud Logging client for programmatic metrics extraction,
It gives the following output showing the daily health report,
=== AGENT ENGINE DAILY HEALTH REPORT ===
Total log events: 1842
Unique sessions: 324
Tool calls: 891
Errors: 7
Error rate: 0.38%
Tool usage breakdown:
get_order_status: 412
check_availability: 287
estimate_delivery: 192
Overall status: HEALTHY
# Error rate below 1% - green
# get_order_status most used tool - prioritise its reliability
# 324 sessions today - compare to yesterday for trend
Alerts notify the ops team automatically when error rate spikes above threshold,
Created notification channel [projects/your-gcp-project/notificationChannels/123456]
Created alert policy [projects/your-gcp-project/alertPolicies/789012]
# When error rate exceeds 5% for 5 consecutive minutes:
# - Email sent to [email protected]
# - Alert appears in Cloud Monitoring dashboard
# - Auto-closes after 1 hour if error rate drops
# Key metrics to monitor in production:
# - request_count by response_code (track 200 vs errors)
# - request_latencies p50/p95/p99 (detect slowdowns)
# - active_sessions count (scale planning)
Monitoring best practices for Agent Engine: set alerts for error_rate above 1% (warning) and 5% (critical). Track p95 latency - if it exceeds 5 seconds users perceive the agent as slow. Use Cloud Logging log-based metrics to count specific event types like tool call failures or escalations - these business metrics are more meaningful than generic error counts. Export logs to BigQuery weekly for trend analysis: which questions are most common, which tools are most called, which sessions take the most turns to resolve. Use these insights to improve your agent instruction and add missing tools. Retention: Agent Engine keeps logs in Cloud Logging for 30 days by default; for compliance requirements create a log sink to Cloud Storage or BigQuery for longer retention.
|
|