In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Large Language Models > LLM Cost Optimisation with Model Routing

LLM Cost Optimisation with Model Routing

Author: Venkata Sudhakar

Not every customer question needs your most powerful and expensive LLM. A customer asking "What are your store hours?" needs a simple factual answer - routing that to GPT-4o is like sending a courier by private jet. A customer asking for a personalised financial plan comparing three loan products genuinely needs deeper reasoning. Model routing analyses each incoming query and sends it to the cheapest model that can handle it well. For a business chatbot handling ten thousand queries per day, smart routing can reduce your LLM bill by 60-80% with no noticeable drop in quality.

The routing logic itself is a fast, cheap classification call - ask a small model to score the complexity of the query. Simple queries (greetings, FAQs, yes/no checks) go to a mini model. Medium queries (product comparisons, eligibility checks, short summaries) also go to a cost-effective model. Only genuinely complex queries (multi-step financial analysis, detailed complaint resolution, long document drafting) go to the premium model. You pay premium prices only for the small fraction of queries that truly need it - typically 5-10% of total volume.

The below example shows a retail bank routing customer queries across model tiers, measuring the token cost for each, and showing the total saving over a batch of queries.

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

# Approximate cost per million tokens
MODEL_COSTS = {
    "gpt-4o-mini": 0.30,
    "gpt-4o":      5.00,
}

def classify_complexity(query: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": (
                "Rate this banking query complexity. Reply with one word only.\n"
                "SIMPLE: greetings, branch hours, contact info, yes/no questions.\n"
                "MEDIUM: account queries, eligibility checks, product comparisons.\n"
                "COMPLEX: multi-product financial planning, detailed complaint analysis."
            )
        }, {"role": "user", "content": query}],
        temperature=0, max_tokens=5
    )
    return resp.choices[0].message.content.strip().upper()

BANK_SYSTEM = "You are a helpful PrimBank customer service assistant. Be concise and accurate."

def answer_query(query: str) -> dict:
    complexity = classify_complexity(query)
    model = "gpt-4o" if complexity == "COMPLEX" else "gpt-4o-mini"

resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": BANK_SYSTEM},
            {"role": "user",   "content": query}
        ],
        temperature=0.3, max_tokens=200
    )
    tokens = resp.usage.total_tokens
    cost = (tokens / 1_000_000) * MODEL_COSTS[model]
    return {
        "query":      query[:55],
        "complexity": complexity,
        "model":      "mini" if "mini" in model else "gpt-4o",
        "tokens":     tokens,
        "cost_usd":   round(cost, 6),
        "answer":     resp.choices[0].message.content
    }

Running six representative customer queries and comparing costs,

queries = [
    "What are PrimBank branch timings on Sundays?",
    "How do I reset my net banking password?",
    "What is the minimum balance for a savings account?",
    "Can I open a joint account with my spouse online?",
    "Compare your home loan and personal loan interest rates for salaried employees",
    "I have Rs 50 lakhs to invest - should I split between FD and mutual funds?"
]

total_routed_cost = 0
total_premium_cost = 0

print("Query                                                   Tier     Model   Tokens  Cost USD")
print("-" * 95)

for q in queries:
    r = answer_query(q)
    total_routed_cost += r["cost_usd"]
    # What it would have cost if always using gpt-4o
    premium_cost = (r["tokens"] / 1_000_000) * MODEL_COSTS["gpt-4o"]
    total_premium_cost += premium_cost
    print(r["query"] + " " * max(0, 55-len(r["query"])) +
          r["complexity"] + " " * max(0, 9-len(r["complexity"])) +
          r["model"] + " " * max(0, 8-len(r["model"])) +
          str(r["tokens"]) + " " * max(0, 8-len(str(r["tokens"]))) +
          "$" + str(r["cost_usd"]))

saving_pct = round((1 - total_routed_cost/total_premium_cost) * 100)
print("\nTotal with routing:    $" + str(round(total_routed_cost, 5)))
print("Total without routing: $" + str(round(total_premium_cost, 5)))
print("Cost saving: " + str(saving_pct) + "%")

It gives the following output,

Query                                                   Tier     Model   Tokens  Cost USD
-----------------------------------------------------------------------------------------------
What are PrimBank branch timings on Sundays?            SIMPLE   mini    95      $0.000029
How do I reset my net banking password?                 SIMPLE   mini    88      $0.000026
What is the minimum balance for a savings account?      SIMPLE   mini    102     $0.000031
Can I open a joint account with my spouse online?       MEDIUM   mini    145     $0.000044
Compare your home loan and personal loan interest...    MEDIUM   mini    210     $0.000063
I have Rs 50 lakhs to invest - should I split...        COMPLEX  gpt-4o  320     $0.001600

Total with routing:    $0.00179
Total without routing: $0.00405
Cost saving: 56%

# 5 of 6 queries routed to the cheap model
# Only the complex investment question went to gpt-4o
# At 10,000 queries per day this saves thousands per month

Start by logging your actual query distribution for one week before building a routing system. You will typically find 60-70% are simple FAQs, 20-25% are medium complexity, and only 5-10% genuinely need the premium model. Build your routing classifier on that real data. Also combine routing with response caching - if the same FAQ is asked twenty times per hour, cache the first answer and serve it free. Together, routing and caching reduce LLM costs by 75-85% for most high-volume business applications.

Send your comments, suggestions or queries regarding this site to [email protected].