tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > Large Language Models > LLM Context Window and Token Limits

LLM Context Window and Token Limits

Author: Venkata Sudhakar

The context window is the maximum amount of text an LLM can process in a single call - everything it can "see" at once, including your system prompt, conversation history, and the current user message. It is measured in tokens, not characters. A token is roughly 3-4 characters for English text, so a 128,000-token context window holds approximately 96,000 words or about 380 pages of text. Everything inside the context window is equally visible to the model. Everything outside it is completely invisible - the model has no access to it at all.

Token limits have two parts: the context window (total tokens the model can process) and max_tokens (the maximum length of the response you are requesting). The cost of an API call is proportional to total tokens used: input tokens (your prompt) plus output tokens (the response). Counting tokens before sending a request lets you avoid hitting limits mid-request and helps you optimise costs. OpenAI provides the tiktoken library for counting tokens locally without making an API call. A common mistake is confusing context window with max_tokens - you set max_tokens to limit the response length, not the total context.

The below example shows how to count tokens, check whether a document fits in the context window, and chunk a large document that exceeds the limit.


It gives the following output,

Document tokens:      2,500
System prompt tokens: 12
Total input tokens:   2,512
Context limit:        128,000
Fits in context:      True

# For a 600-page PDF (~150,000 tokens), it would NOT fit:
# Total input tokens:   150,012
# Fits in context:      False  <- must chunk the document

It gives the following output,

Split into 5 chunks
  Chunk 1 summarised: 4000 tokens -> 87 tokens
  Chunk 2 summarised: 4000 tokens -> 91 tokens
  Chunk 3 summarised: 4000 tokens -> 83 tokens
  Chunk 4 summarised: 4000 tokens -> 79 tokens
  Chunk 5 summarised: 2500 tokens -> 76 tokens

Final summary: The document outlines a phased data migration approach...

Context window strategies for large content: for documents that almost fit, use a larger model with a bigger context (gpt-4o at 128k). For documents that exceed any model context, use the map-reduce pattern above - summarise chunks separately then summarise the summaries. For interactive use cases like chatbots with long histories, use a sliding window that keeps only the last N messages, or summarise older messages into a running summary and keep only the summary plus recent turns.


 
  


  
bl  br