In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > AI Observability > Evaluating LLM Output Quality with RAGAS

Evaluating LLM Output Quality with RAGAS

Author: Venkata Sudhakar

ShopMax India's RAG-based product assistant retrieves documents and generates answers, but how do you know if the answers are actually correct? RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that automatically evaluates RAG pipelines on four key metrics: faithfulness (does the answer stick to the retrieved context?), answer relevancy (does the answer address the question?), context precision (are retrieved chunks actually relevant?), and context recall (were all relevant chunks retrieved?). ShopMax India uses RAGAS to catch regressions when prompts or retrieval logic changes.

RAGAS takes three inputs for each test question: the question, the generated answer, and the retrieved contexts. It then runs LLM-based evaluations to score each metric from 0 to 1. Faithfulness checks whether every claim in the answer can be traced to the context. Answer relevancy embeds the question and answer and measures their semantic similarity. Context precision checks if retrieved chunks are ranked correctly. Context recall compares retrieved chunks against a ground truth answer. The evaluate() function returns a dataset with scores for all metrics across all test questions.

The example below runs RAGAS evaluation on ShopMax India product Q&A. It defines three test questions with ground truth answers and retrieved context, then evaluates the RAG pipeline and prints per-question metric scores.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
import os

os.environ["OPENAI_API_KEY"] = "sk-your-key-here"

test_data = {
    "question": [
        "What is the price of Samsung 65 QLED TV?",
        "Does OnePlus 11 support fast charging?",
        "Which city is the Sony Bravia 55 available in?"
    ],
    "answer": [
        "The Samsung 65 QLED TV is priced at Rs 85000.",
        "Yes, the OnePlus 11 supports 80W SuperVOOC fast charging.",
        "The Sony Bravia 55 is available in Chennai and Bangalore."
    ],
    "contexts": [
        ["Samsung 65 QLED: 4K 120Hz HDR10+ price Rs 85000 available Mumbai."],
        ["OnePlus 11 5G: Snapdragon 8 Gen2, 80W SuperVOOC charging, 5000mAh battery."],
        ["Sony Bravia 55: OLED 4K Dolby Vision, available Chennai and Bangalore."]
    ],
    "ground_truth": [
        "Rs 85000",
        "Yes, 80W SuperVOOC fast charging",
        "Chennai and Bangalore"
    ]
}

dataset = Dataset.from_dict(test_data)

results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print("RAGAS Evaluation Results for ShopMax India RAG Pipeline:")
print("-" * 60)
df = results.to_pandas()
for i, row in df.iterrows():
    q = row.get("question", "")[:45]
    faith = round(row.get("faithfulness", 0), 3)
    rel = round(row.get("answer_relevancy", 0), 3)
    prec = round(row.get("context_precision", 0), 3)
    rec = round(row.get("context_recall", 0), 3)
    print("Q: " + q)
    print("  Faithfulness: " + str(faith) + " | Relevancy: " + str(rel) +
          " | Precision: " + str(prec) + " | Recall: " + str(rec))

print()
agg = results.to_pandas().mean(numeric_only=True)
print("Overall averages:")
print("  Faithfulness:     " + str(round(agg.get("faithfulness", 0), 3)))
print("  Answer Relevancy: " + str(round(agg.get("answer_relevancy", 0), 3)))
print("  Context Precision:" + str(round(agg.get("context_precision", 0), 3)))
print("  Context Recall:   " + str(round(agg.get("context_recall", 0), 3)))

It gives the following output,

RAGAS Evaluation Results for ShopMax India RAG Pipeline:
------------------------------------------------------------
Q: What is the price of Samsung 65 QLED TV?
  Faithfulness: 1.0 | Relevancy: 0.96 | Precision: 1.0 | Recall: 1.0
Q: Does OnePlus 11 support fast charging?
  Faithfulness: 1.0 | Relevancy: 0.94 | Precision: 1.0 | Recall: 1.0
Q: Which city is the Sony Bravia 55 available in?
  Faithfulness: 0.5 | Relevancy: 0.89 | Precision: 1.0 | Recall: 1.0

Overall averages:
  Faithfulness:     0.833
  Answer Relevancy: 0.93
  Context Precision:1.0
  Context Recall:   1.0

In production, build a RAGAS test suite of 100+ ShopMax India golden questions covering every product category and run it automatically in CI on every RAG pipeline change. A faithfulness score below 0.8 signals hallucination - the LLM is adding facts not in the retrieved context. A context precision below 0.7 signals the retriever is returning irrelevant chunks - tune chunk size or embedding model. Use ragas.testset.generate to automatically generate test questions from your document corpus so the test suite grows with your product catalog without manual labeling.

Send your comments, suggestions or queries regarding this site to [email protected].