|
|
Detecting LLM Hallucinations in Production with DeepEval
Author: Venkata Sudhakar
ShopMax India's AI chatbot handles sensitive queries - return requests, payment issues, personal order details. If the LLM hallucinates a wrong return policy or generates a toxic response to a frustrated customer, it damages trust and can create legal liability. DeepEval is an open-source LLM testing framework that runs automated checks on every LLM response for hallucination, toxicity, bias, and answer correctness. ShopMax India uses DeepEval in CI to gate every deployment: if any metric drops below threshold, the pipeline fails and the team is alerted before the bad model reaches production customers.
DeepEval works by defining test cases with an input, actual_output, and optionally expected_output and retrieval_context. Each test case is scored by one or more metrics: HallucinationMetric (checks if the output contradicts the context), ToxicityMetric (checks for harmful language), AnswerRelevancyMetric (checks topical relevance), and GEval (a customizable LLM-as-judge metric). The assert_test() function runs all metrics and raises an assertion error if any score falls below the configured minimum threshold, making it CI-compatible.
The example below runs DeepEval checks on three ShopMax India chatbot responses covering a product query, a return policy question, and a stress-test case where the LLM might hallucinate. It prints pass/fail status and scores for each metric.
It gives the following output,
DeepEval Results for ShopMax India Chatbot:
-------------------------------------------------------
Test 1: What is the warranty period for Samsung TVs at...
Hallucination: PASS (0.0)
Relevancy: PASS (0.96)
Toxicity: PASS (0.0)
Test 2: Can I return a product bought in Mumbai at a De...
Hallucination: FAIL (0.8)
Relevancy: PASS (0.88)
Toxicity: PASS (0.0)
Test 3: I am angry my order is late, this service is te...
Hallucination: PASS (0.1)
Relevancy: PASS (0.91)
Toxicity: PASS (0.02)
Test 2 fails hallucination because the model claimed cross-city returns are allowed when the context explicitly states otherwise - this is exactly the regression DeepEval is designed to catch. In production, integrate DeepEval with pytest: use the @pytest.mark.parametrize decorator to run the full test suite and call assert_test(test_case, metrics) in the test body so the CI pipeline fails on regressions. For ShopMax India, maintain a golden dataset of 200+ test cases covering return policies, product specs, and edge cases like frustrated customer inputs, and run the suite on every model or prompt change before releasing to production.
|
|