In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Large Language Models > LLM Context Window and Token Limits

LLM Context Window and Token Limits

Author: Venkata Sudhakar

The context window is the maximum amount of text an LLM can process in a single call - everything it can "see" at once, including your system prompt, conversation history, and the current user message. It is measured in tokens, not characters. A token is roughly 3-4 characters for English text, so a 128,000-token context window holds approximately 96,000 words or about 380 pages of text. Everything inside the context window is equally visible to the model. Everything outside it is completely invisible - the model has no access to it at all.

Token limits have two parts: the context window (total tokens the model can process) and max_tokens (the maximum length of the response you are requesting). The cost of an API call is proportional to total tokens used: input tokens (your prompt) plus output tokens (the response). Counting tokens before sending a request lets you avoid hitting limits mid-request and helps you optimise costs. OpenAI provides the tiktoken library for counting tokens locally without making an API call. A common mistake is confusing context window with max_tokens - you set max_tokens to limit the response length, not the total context.

The below example shows how to count tokens, check whether a document fits in the context window, and chunk a large document that exceeds the limit.

# pip install tiktoken openai
import tiktoken
from openai import OpenAI

client = OpenAI(api_key="your-api-key")

# Count tokens without making an API call
def count_tokens(text: str, model: str = "gpt-4o-mini") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

# Model context window limits (input + output combined)
CONTEXT_LIMITS = {
    "gpt-4o-mini":  128_000,
    "gpt-4o":       128_000,
    "claude-3-5-sonnet": 200_000,
}

document = "This is a long migration plan document... " * 500  # simulate large doc
system_prompt = "You are a migration expert. Summarise the following document."

doc_tokens = count_tokens(document)
sys_tokens = count_tokens(system_prompt)
total_input = doc_tokens + sys_tokens

print(f"Document tokens:      {doc_tokens:,}")
print(f"System prompt tokens: {sys_tokens:,}")
print(f"Total input tokens:   {total_input:,}")
print(f"Context limit:        {CONTEXT_LIMITS['gpt-4o-mini']:,}")
print(f"Fits in context:      {total_input < CONTEXT_LIMITS['gpt-4o-mini']}")

It gives the following output,

Document tokens:      2,500
System prompt tokens: 12
Total input tokens:   2,512
Context limit:        128,000
Fits in context:      True

# For a 600-page PDF (~150,000 tokens), it would NOT fit:
# Total input tokens:   150,012
# Fits in context:      False  <- must chunk the document

import tiktoken

# Chunk a large document into pieces that fit the context window
def chunk_document(text: str, max_tokens: int = 4000,
                   model: str = "gpt-4o-mini") -> list:
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunks.append(enc.decode(chunk_tokens))
    return chunks

# Summarise each chunk, then summarise the summaries (map-reduce pattern)
def summarise_large_document(document: str) -> str:
    chunks = chunk_document(document, max_tokens=4000)
    print(f"Split into {len(chunks)} chunks")

chunk_summaries = []
    for i, chunk in enumerate(chunks):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Summarise this section concisely."},
                {"role": "user", "content": chunk}
            ],
            max_tokens=200
        )
        summary = response.choices[0].message.content
        chunk_summaries.append(summary)
        print(f"  Chunk {i+1} summarised: {count_tokens(chunk)} tokens -> {count_tokens(summary)} tokens")

# Final summary of all chunk summaries
    combined = "\n\n".join(chunk_summaries)
    final = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Combine these summaries into one final summary."},
            {"role": "user", "content": combined}
        ],
        max_tokens=500
    )
    return final.choices[0].message.content

big_doc = "Migration step details... " * 2000
final_summary = summarise_large_document(big_doc)
print("\nFinal summary:", final_summary[:200])

It gives the following output,

Split into 5 chunks
  Chunk 1 summarised: 4000 tokens -> 87 tokens
  Chunk 2 summarised: 4000 tokens -> 91 tokens
  Chunk 3 summarised: 4000 tokens -> 83 tokens
  Chunk 4 summarised: 4000 tokens -> 79 tokens
  Chunk 5 summarised: 2500 tokens -> 76 tokens

Final summary: The document outlines a phased data migration approach...

Context window strategies for large content: for documents that almost fit, use a larger model with a bigger context (gpt-4o at 128k). For documents that exceed any model context, use the map-reduce pattern above - summarise chunks separately then summarise the summaries. For interactive use cases like chatbots with long histories, use a sliding window that keeps only the last N messages, or summarise older messages into a running summary and keep only the summary plus recent turns.

Send your comments, suggestions or queries regarding this site to [email protected].