In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > MCP Protocol > MCP Server Rate Limiting and Throttling

MCP Server Rate Limiting and Throttling

Author: Venkata Sudhakar

When multiple ADK agents share an MCP server, a single agent can exhaust downstream API quotas or degrade performance for others. Rate limiting at the MCP server layer ensures fair usage across callers, protects external APIs from overload, and makes tool call behaviour predictable under high concurrency.

In this tutorial, you will add a token bucket rate limiter to an MCP server. Each caller is tracked by a client ID passed as a tool argument. If the bucket is empty, the server returns an error message instead of calling the downstream API. This approach requires no external dependencies beyond the Python standard library.

The implementation below uses a simple in-memory token bucket. Each client starts with a full bucket of tokens. One token is consumed per tool call. Tokens refill at a fixed rate using elapsed time since the last call.

# rate_limited_server.py
import asyncio
import time
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent

app = Server("rate-limited-mcp-server")

RATE = 5
CAPACITY = 10
buckets = {}

def get_tokens(client_id):
    now = time.monotonic()
    if client_id not in buckets:
        buckets[client_id] = {"tokens": CAPACITY, "last": now}
    bucket = buckets[client_id]
    elapsed = now - bucket["last"]
    bucket["tokens"] = min(CAPACITY, bucket["tokens"] + elapsed * RATE)
    bucket["last"] = now
    return bucket

def consume_token(client_id):
    bucket = get_tokens(client_id)
    if bucket["tokens"] >= 1:
        bucket["tokens"] -= 1
        return True
    return False

@app.list_tools()
async def list_tools():
    return [
        Tool(
            name="fetch_report",
            description="Fetch a report by name. Rate limited to 5 calls per second per client.",
            inputSchema={
                "type": "object",
                "properties": {
                    "report_name": {"type": "string"},
                    "client_id": {"type": "string"}
                },
                "required": ["report_name", "client_id"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name, arguments):
    client_id = arguments.get("client_id", "default")
    if not consume_token(client_id):
        return [TextContent(type="text", text="Rate limit exceeded. Please retry after 1 second.")]
    if name == "fetch_report":
        report = f"Report data for: {arguments["report_name"]}"
        return [TextContent(type="text", text=report)]

async def main():
    async with stdio_server() as (read_stream, write_stream):
        await app.run(read_stream, write_stream, app.create_initialization_options())

if __name__ == "__main__":
    asyncio.run(main())

The test below simulates burst calls from two clients. Calls within the token budget succeed immediately. Once the bucket empties, the server returns a rate limit error that the agent can surface to the user or handle with a retry strategy.

This pattern works well when a shared MCP server is accessed by many agents concurrently. For production use, replace the in-memory dictionary with Redis or Memorystore to share rate limit state across multiple server instances. You can also apply different rate limits per tool or per client tier.

Send your comments, suggestions or queries regarding this site to [email protected].