|
|
Graph RAG Evaluation - Measuring Multi-hop Retrieval Quality
Author: Venkata Sudhakar
ShopMax India's AI team needs to measure whether Graph RAG actually improves answer quality over standard vector RAG for their use cases. The RAGAS framework provides metrics for faithfulness, answer relevancy, and context recall that work for both vector and graph RAG pipelines, letting teams compare approaches objectively before committing to a more complex graph infrastructure.
Graph RAG evaluation needs metrics beyond standard RAG: entity recall (did the graph retrieve the right entities?), relationship coverage (were relevant edges traversed?), and multi-hop accuracy (can the pipeline answer questions requiring 2 or more hops?). RAGAS provides faithfulness and context recall out of the box. You add entity-level metrics by comparing extracted entities in the response against a ground-truth entity set.
The below example shows how ShopMax India evaluates its Graph RAG pipeline using RAGAS metrics on supplier and warranty query test cases.
It gives the following output,
{'faithfulness': 0.94, 'answer_relevancy': 0.91, 'context_recall': 0.88}
Faithfulness 0.94: Answers are well-grounded in retrieved graph context
Answer Relevancy 0.91: Responses directly address the questions asked
Context Recall 0.88: Most ground-truth facts present in graph context
Run evaluation on 20 to 50 representative questions covering single-hop (one relationship) and multi-hop (two or more relationships) queries. For ShopMax India, build separate eval sets for supplier queries, warranty queries, and product comparison queries. A faithfulness score below 0.80 indicates hallucination - tighten your system prompt. Context recall below 0.75 means the graph traversal is missing relevant nodes - review your Cypher or GQL query generation prompts.
|
|