In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Graph RAG > Graph RAG Evaluation - Measuring Multi-hop Retrieval Quality

Graph RAG Evaluation - Measuring Multi-hop Retrieval Quality

Author: Venkata Sudhakar

ShopMax India's AI team needs to measure whether Graph RAG actually improves answer quality over standard vector RAG for their use cases. The RAGAS framework provides metrics for faithfulness, answer relevancy, and context recall that work for both vector and graph RAG pipelines, letting teams compare approaches objectively before committing to a more complex graph infrastructure.

Graph RAG evaluation needs metrics beyond standard RAG: entity recall (did the graph retrieve the right entities?), relationship coverage (were relevant edges traversed?), and multi-hop accuracy (can the pipeline answer questions requiring 2 or more hops?). RAGAS provides faithfulness and context recall out of the box. You add entity-level metrics by comparing extracted entities in the response against a ground-truth entity set.

The below example shows how ShopMax India evaluates its Graph RAG pipeline using RAGAS metrics on supplier and warranty query test cases.

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

# Test cases for ShopMax India Graph RAG evaluation
eval_data = {
    "question": [
        "Which suppliers provide TVs with claims in Mumbai?",
        "What warranty covers Samsung laptops bought in Delhi?"
    ],
    "answer": [
        "Mehta Electronics and Sunrise Corp supply TVs with 23 and 15 claims in Mumbai respectively.",
        "Samsung laptops in Delhi carry a 1-year on-site warranty covering hardware defects, serviced by Kumar Tech Solutions."
    ],
    "contexts": [
        [
            "Supplier: Mehta Electronics, Product: TV, City: Mumbai, Claims: 23",
            "Supplier: Sunrise Corp, Product: TV, City: Mumbai, Claims: 15"
        ],
        [
            "Samsung laptop warranty: 1-year on-site, covers hardware defects",
            "Delhi service partner: Kumar Tech Solutions"
        ]
    ],
    "ground_truth": [
        "Mehta Electronics (23 claims) and Sunrise Corp (15 claims) supply TVs in Mumbai.",
        "1-year on-site Samsung warranty in Delhi via Kumar Tech Solutions."
    ]
}

dataset = Dataset.from_dict(eval_data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
print(results)

It gives the following output,

{'faithfulness': 0.94, 'answer_relevancy': 0.91, 'context_recall': 0.88}

Faithfulness 0.94: Answers are well-grounded in retrieved graph context
Answer Relevancy 0.91: Responses directly address the questions asked
Context Recall 0.88: Most ground-truth facts present in graph context

Run evaluation on 20 to 50 representative questions covering single-hop (one relationship) and multi-hop (two or more relationships) queries. For ShopMax India, build separate eval sets for supplier queries, warranty queries, and product comparison queries. A faithfulness score below 0.80 indicates hallucination - tighten your system prompt. Context recall below 0.75 means the graph traversal is missing relevant nodes - review your Cypher or GQL query generation prompts.

Send your comments, suggestions or queries regarding this site to [email protected].