tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > AI Observability > Evaluating LLM Output Quality with RAGAS

Evaluating LLM Output Quality with RAGAS

Author: Venkata Sudhakar

ShopMax India's RAG-based product assistant retrieves documents and generates answers, but how do you know if the answers are actually correct? RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that automatically evaluates RAG pipelines on four key metrics: faithfulness (does the answer stick to the retrieved context?), answer relevancy (does the answer address the question?), context precision (are retrieved chunks actually relevant?), and context recall (were all relevant chunks retrieved?). ShopMax India uses RAGAS to catch regressions when prompts or retrieval logic changes.

RAGAS takes three inputs for each test question: the question, the generated answer, and the retrieved contexts. It then runs LLM-based evaluations to score each metric from 0 to 1. Faithfulness checks whether every claim in the answer can be traced to the context. Answer relevancy embeds the question and answer and measures their semantic similarity. Context precision checks if retrieved chunks are ranked correctly. Context recall compares retrieved chunks against a ground truth answer. The evaluate() function returns a dataset with scores for all metrics across all test questions.

The example below runs RAGAS evaluation on ShopMax India product Q&A. It defines three test questions with ground truth answers and retrieved context, then evaluates the RAG pipeline and prints per-question metric scores.


It gives the following output,

RAGAS Evaluation Results for ShopMax India RAG Pipeline:
------------------------------------------------------------
Q: What is the price of Samsung 65 QLED TV?
  Faithfulness: 1.0 | Relevancy: 0.96 | Precision: 1.0 | Recall: 1.0
Q: Does OnePlus 11 support fast charging?
  Faithfulness: 1.0 | Relevancy: 0.94 | Precision: 1.0 | Recall: 1.0
Q: Which city is the Sony Bravia 55 available in?
  Faithfulness: 0.5 | Relevancy: 0.89 | Precision: 1.0 | Recall: 1.0

Overall averages:
  Faithfulness:     0.833
  Answer Relevancy: 0.93
  Context Precision:1.0
  Context Recall:   1.0

In production, build a RAGAS test suite of 100+ ShopMax India golden questions covering every product category and run it automatically in CI on every RAG pipeline change. A faithfulness score below 0.8 signals hallucination - the LLM is adding facts not in the retrieved context. A context precision below 0.7 signals the retriever is returning irrelevant chunks - tune chunk size or embedding model. Use ragas.testset.generate to automatically generate test questions from your document corpus so the test suite grows with your product catalog without manual labeling.


 
  


  
bl  br