tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > Large Language Models > LLM Evaluation Metrics - BLEU, ROUGE and Perplexity

LLM Evaluation Metrics - BLEU, ROUGE and Perplexity

Author: Venkata Sudhakar

Evaluating LLM output quality requires quantitative metrics that go beyond human judgment. BLEU, ROUGE, and Perplexity are the three most commonly used metrics in LLM benchmarking and fine-tuning pipelines. Understanding them helps you measure whether your model or prompt changes are actually improving output quality.

BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between generated and reference text - widely used for translation. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) focuses on recall of reference content - used for summarisation. Perplexity measures how confidently a model predicts the next token - lower perplexity means the model finds the text more predictable.

The below example computes all three metrics for ShopMax India product description generation outputs using the sacrebleu and rouge-score libraries.


It gives the following output,

BLEU score: 42.18
rouge1 - P: 0.812, R: 0.867, F1: 0.839
rouge2 - P: 0.636, R: 0.688, F1: 0.661
rougeL - P: 0.750, R: 0.800, F1: 0.774
Perplexity: 1.48

A BLEU score above 40 indicates high quality translation or generation. ROUGE F1 above 0.7 is considered good for summarisation tasks. Perplexity below 10 on domain-specific text indicates the model has good coverage of that vocabulary. In production, combine all three metrics with human evaluation for a complete picture of LLM quality before deploying changes to your ShopMax AI assistant.


 
  


  
bl  br