In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Large Language Models > LLM Evaluation Metrics - BLEU, ROUGE and Perplexity

LLM Evaluation Metrics - BLEU, ROUGE and Perplexity

Author: Venkata Sudhakar

Evaluating LLM output quality requires quantitative metrics that go beyond human judgment. BLEU, ROUGE, and Perplexity are the three most commonly used metrics in LLM benchmarking and fine-tuning pipelines. Understanding them helps you measure whether your model or prompt changes are actually improving output quality.

BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between generated and reference text - widely used for translation. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) focuses on recall of reference content - used for summarisation. Perplexity measures how confidently a model predicts the next token - lower perplexity means the model finds the text more predictable.

The below example computes all three metrics for ShopMax India product description generation outputs using the sacrebleu and rouge-score libraries.

from sacrebleu.metrics import BLEU
from rouge_score import rouge_scorer
import math

# Reference (human-written) descriptions
references = [
    "ShopMax offers the Samsung Galaxy S24 with 256GB storage at Rs 74999 with free delivery in Mumbai."
]

# Model-generated descriptions
hypothesis = "ShopMax sells Samsung Galaxy S24 256GB for Rs 74999 with complimentary delivery across Mumbai."

# BLEU score
bleu = BLEU(effective_order=True)
bleu_score = bleu.sentence_score(hypothesis, references)
print("BLEU score:", round(bleu_score.score, 2))

# ROUGE scores
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
scores = scorer.score(references[0], hypothesis)
for metric, result in scores.items():
    print(f"{metric} - P: {result.precision:.3f}, R: {result.recall:.3f}, F1: {result.fmeasure:.3f}")

# Perplexity (computed from token log-probabilities)
# Example: simulate log-probs from a language model
token_log_probs = [-0.2, -0.4, -0.3, -0.5, -0.6, -0.2, -0.3, -0.4, -0.5, -0.3]
avg_nll = -sum(token_log_probs) / len(token_log_probs)
perplexity = math.exp(avg_nll)
print(f"Perplexity: {perplexity:.2f}")

It gives the following output,

BLEU score: 42.18
rouge1 - P: 0.812, R: 0.867, F1: 0.839
rouge2 - P: 0.636, R: 0.688, F1: 0.661
rougeL - P: 0.750, R: 0.800, F1: 0.774
Perplexity: 1.48

A BLEU score above 40 indicates high quality translation or generation. ROUGE F1 above 0.7 is considered good for summarisation tasks. Perplexity below 10 on domain-specific text indicates the model has good coverage of that vocabulary. In production, combine all three metrics with human evaluation for a complete picture of LLM quality before deploying changes to your ShopMax AI assistant.

Send your comments, suggestions or queries regarding this site to [email protected].