RAGChecker: A Fine-grained Framework for Diagnosing RAG Systems

Introduction

RAGChecker is an advanced automatic evaluation framework developed by Amazon Science, specifically designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It provides a comprehensive suite of metrics and tools for in-depth analysis of RAG performance, helping to identify and address issues within both the retrieval and generation components. This framework utilizes claim-level entailment operations for fine-grained evaluation, offering valuable insights for targeted improvements.

Installation

To get started with RAGChecker, you can install it via pip and download the necessary spaCy model:

pip install ragchecker
python -m spacy download en_core_web_sm

Examples

RAGChecker supports both command-line interface (CLI) and Python API for evaluating your RAG systems.

CLI Example

First, prepare your data in a JSON format similar to the example below, where gt_answer is the only required annotation for each query:

{
  "results": [
    {
      "query_id": "<query id>",
      "query": "<input query>",
      "gt_answer": "<ground truth answer>",
      "response": "<response generated by the RAG generator>",
      "retrieved_context": [
        {
          "doc_id": "<doc id>",
          "text": "<content of the chunk>"
        }
      ]
    }
  ]
}

Then, run the checking pipeline using the ragchecker-cli command, specifying your input and output paths, and the models for the extractor and checker:

ragchecker-cli \
    --input_path=examples/checking_inputs.json \
    --output_path=examples/checking_outputs.json \
    --extractor_name=bedrock/meta.llama3-1-70b-instruct-v1:0 \
    --checker_name=bedrock/meta.llama3-1-70b-instruct-v1:0 \
    --batch_size_extractor=64 \
    --batch_size_checker=64 \
    --metrics all_metrics

The output will provide detailed metrics:

{
  "overall_metrics": {
    "precision": 73.3,
    "recall": 62.5,
    "f1": 67.3
  },
  "retriever_metrics": {
    "claim_recall": 61.4,
    "context_precision": 87.5
  },
  "generator_metrics": {
    "context_utilization": 87.5,
    "noise_sensitivity_in_relevant": 22.5,
    "noise_sensitivity_in_irrelevant": 0.0,
    "hallucination": 4.2,
    "self_knowledge": 25.0,
    "faithfulness": 70.8
  }
}

Python API Example

You can also integrate RAGChecker directly into your Python code:

from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics

# initialize ragresults from json/dict
with open("examples/checking_inputs.json") as fp:
    rag_results = RAGResults.from_json(fp.read())

# set-up the evaluator
evaluator = RAGChecker(
    extractor_name="bedrock/meta.llama3-1-70b-instruct-v1:0",
    checker_name="bedrock/meta.llama3-1-70b-instruct-v1:0",
    batch_size_extractor=32,
    batch_size_checker=32
)

# evaluate results with selected metrics or certain groups, e.g., retriever_metrics, generator_metrics, all_metrics
evaluator.evaluate(rag_results, all_metrics)
print(rag_results)

Why Use RAGChecker

RAGChecker empowers developers and researchers to thoroughly evaluate, diagnose, and enhance their RAG systems with precision and depth. Its key benefits include:

Holistic Evaluation: Offers Overall Metrics for a comprehensive assessment of the entire RAG pipeline.
Diagnostic Metrics: Provides Diagnostic Retriever Metrics and Diagnostic Generator Metrics to analyze specific components, offering valuable insights for targeted improvements.
Fine-grained Evaluation: Utilizes claim-level entailment operations for highly detailed evaluation.
Benchmark Dataset: Includes a comprehensive RAG benchmark dataset for robust testing.
Meta-Evaluation: Features a human-annotated preference dataset to correlate RAGChecker's results with human judgments.
LlamaIndex Integration: Seamlessly integrates with LlamaIndex, making it a powerful evaluation tool for RAG applications built with LlamaIndex.