In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > MCP Protocol > MCP Server Batch Processing with Dataflow

MCP Server Batch Processing with Dataflow

Author: Venkata Sudhakar

Google Cloud Dataflow is a fully managed service for running Apache Beam pipelines for batch and streaming data processing. When AI agents need to trigger large-scale data transformations - such as aggregating millions of transactions, running ETL jobs, or processing uploaded datasets - Dataflow provides the scalable compute backbone.

ShopMax India uses Dataflow to process nightly sales aggregations, generate regional performance summaries, and run product return analytics across all cities. The MCP server below lets ADK agents launch a Dataflow batch job and check its execution status without needing to understand the underlying pipeline infrastructure.

The below example shows an MCP server that launches Dataflow batch jobs using a pre-built Flex Template and polls job status for agents to monitor progress.

# Install: pip install mcp google-cloud-dataflow-client
import asyncio
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp import types
from googleapiclient import discovery
from google.oauth2 import service_account
import google.auth

PROJECT = "shopmax-india"
LOCATION = "asia-south1"
TEMPLATE_GCS = "gs://shopmax-templates/sales-aggregation"
TEMP_BUCKET = "gs://shopmax-dataflow-temp"

credentials, _ = google.auth.default()
df_service = discovery.build("dataflow", "v1b3", credentials=credentials)
app = Server("dataflow-mcp")

@app.list_tools()
async def list_tools():
    return [
        types.Tool(
            name="launch_batch_job",
            description="Launch a Dataflow batch job using a Flex Template",
            inputSchema={
                "type": "object",
                "properties": {
                    "job_name": {"type": "string", "description": "Unique job name"},
                    "date": {"type": "string", "description": "Processing date YYYY-MM-DD"},
                    "city": {"type": "string", "description": "City to process or ALL"}
                },
                "required": ["job_name", "date"]
            }
        ),
        types.Tool(
            name="get_job_status",
            description="Get the current status of a Dataflow job",
            inputSchema={
                "type": "object",
                "properties": {
                    "job_id": {"type": "string"}
                },
                "required": ["job_id"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "launch_batch_job":
        body = {
            "launchParameter": {
                "jobName": arguments["job_name"],
                "containerSpecGcsPath": TEMPLATE_GCS,
                "parameters": {
                    "input_date": arguments["date"],
                    "city": arguments.get("city", "ALL"),
                    "output_table": "shopmax-india:analytics.sales_summary"
                },
                "environment": {
                    "tempLocation": TEMP_BUCKET,
                    "region": LOCATION
                }
            }
        }
        resp = df_service.projects().locations().flexTemplates().launch(
            projectId=PROJECT, location=LOCATION, body=body
        ).execute()
        job_id = resp["job"]["id"]
        return [types.TextContent(type="text",
            text="Job launched. ID: " + job_id)]

elif name == "get_job_status":
        resp = df_service.projects().locations().jobs().get(
            projectId=PROJECT, location=LOCATION,
            jobId=arguments["job_id"]
        ).execute()
        return [types.TextContent(type="text", text=
            "Job: " + resp.get("name", "") +
            " | State: " + resp.get("currentState", "") +
            " | Started: " + resp.get("startTime", ""))]

async def main():
    async with stdio_server() as streams:
        await app.run(streams[0], streams[1], app.create_initialization_options())

if __name__ == "__main__":
    asyncio.run(main())

It gives the following output,

# Agent query: "Run sales aggregation for April 10 for all cities"
Tool: launch_batch_job({job_name: "sales-agg-20260410", date: "2026-04-10", city: "ALL"})

Job launched. ID: 2026-04-10_22_15_44-3847291029512345678

# Agent query: "Check status of that job"
Tool: get_job_status({job_id: "2026-04-10_22_15_44-3847291029512345678"})

Job: sales-agg-20260410 | State: JOB_STATE_RUNNING | Started: 2026-04-10T22:15:50Z

# After completion:
Job: sales-agg-20260410 | State: JOB_STATE_DONE | Started: 2026-04-10T22:15:50Z

Store Dataflow Flex Templates in GCS and version them so agents always launch known-good pipeline versions. Set worker machine type and max workers in the job environment to control costs on large processing runs. For recurring jobs, combine this MCP server with the Cloud Scheduler MCP to schedule nightly pipeline launches automatically. Always write Dataflow output to BigQuery or GCS - not directly to operational databases - to isolate batch workloads from transactional traffic.

Send your comments, suggestions or queries regarding this site to [email protected].