In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > RAG Pipelines > RAG Evaluation with RAGAS

RAG Evaluation with RAGAS

Author: Venkata Sudhakar

Building a RAG pipeline is only half the work - measuring whether it is actually working correctly is equally important. RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that evaluates RAG pipelines across four key metrics: faithfulness, answer relevancy, context precision, and context recall. Without evaluation, you cannot confidently improve or deploy a RAG system.

Faithfulness checks whether the generated answer is factually consistent with the retrieved context. Answer relevancy measures how well the answer addresses the question. Context precision checks whether the retrieved chunks are actually useful. Context recall measures whether all relevant information was retrieved. Together they give a complete picture of RAG quality.

The below example evaluates a ShopMax India product FAQ RAG pipeline using RAGAS, measuring all four metrics on a sample question-answer dataset.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

# Sample RAG evaluation dataset
data = {
    "question": [
        "What is the return policy at ShopMax India?",
        "Does ShopMax offer EMI options for laptops?"
    ],
    "answer": [
        "ShopMax India allows returns within 10 days in original packaging.",
        "Yes, ShopMax offers 0% EMI on laptops above Rs 30000 for 6 months."
    ],
    "contexts": [
        ["ShopMax return policy: Products can be returned within 10 days with original packaging and accessories."],
        ["ShopMax EMI: 0% interest EMI available on electronics above Rs 30000 for 3, 6, and 12 month tenures."]
    ],
    "ground_truth": [
        "Items can be returned within 10 days of purchase in original packaging.",
        "ShopMax provides 0% EMI for 6 and 12 months on purchases above Rs 30000."
    ]
}

dataset = Dataset.from_dict(data)

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print("RAGAS Evaluation Results:")
for metric, score in results.items():
    print(f"  {metric}: {score:.3f}")

It gives the following output,

RAGAS Evaluation Results:
  faithfulness: 0.917
  answer_relevancy: 0.894
  context_precision: 0.875
  context_recall: 0.833

Scores above 0.8 across all metrics indicate a well-functioning RAG pipeline. A low faithfulness score (below 0.7) indicates hallucination - the model is adding facts not present in the context. A low context recall score indicates chunking or retrieval problems. Run RAGAS evaluations on a labelled test set of 50 to 100 questions before deploying any ShopMax RAG system to production, and re-evaluate after every significant change to your chunking strategy, embedding model, or prompt.

Send your comments, suggestions or queries regarding this site to [email protected].