RAGChecker: A Fine-grained Framework for Diagnosing RAG Systems
This repository profile is provided by osrepos.com, an open source repository discovery platform.

Summary
RAGChecker is an advanced automatic evaluation framework developed by Amazon Science, specifically designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It offers a comprehensive suite of metrics and tools for in-depth analysis of RAG performance. This framework empowers developers and researchers to thoroughly evaluate and enhance their RAG systems with precision.
Repository Information
Use at your own risk
OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.
Introduction
RAGChecker is an advanced automatic evaluation framework developed by Amazon Science, specifically designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It provides a comprehensive suite of metrics and tools for in-depth analysis of RAG performance, helping to identify and address issues within both the retrieval and generation components. This framework utilizes claim-level entailment operations for fine-grained evaluation, offering valuable insights for targeted improvements.
Installation
To get started with RAGChecker, you can install it via pip and download the necessary spaCy model:
pip install ragchecker
python -m spacy download en_core_web_sm
Examples
RAGChecker supports both command-line interface (CLI) and Python API for evaluating your RAG systems.
CLI Example
First, prepare your data in a JSON format similar to the example below, where gt_answer is the only required annotation for each query:
{
"results": [
{
"query_id": "<query id>",
"query": "<input query>",
"gt_answer": "<ground truth answer>",
"response": "<response generated by the RAG generator>",
"retrieved_context": [
{
"doc_id": "<doc id>",
"text": "<content of the chunk>"
}
]
}
]
}
Then, run the checking pipeline using the ragchecker-cli command, specifying your input and output paths, and the models for the extractor and checker:
ragchecker-cli \
--input_path=examples/checking_inputs.json \
--output_path=examples/checking_outputs.json \
--extractor_name=bedrock/meta.llama3-1-70b-instruct-v1:0 \
--checker_name=bedrock/meta.llama3-1-70b-instruct-v1:0 \
--batch_size_extractor=64 \
--batch_size_checker=64 \
--metrics all_metrics
The output will provide detailed metrics:
{
"overall_metrics": {
"precision": 73.3,
"recall": 62.5,
"f1": 67.3
},
"retriever_metrics": {
"claim_recall": 61.4,
"context_precision": 87.5
},
"generator_metrics": {
"context_utilization": 87.5,
"noise_sensitivity_in_relevant": 22.5,
"noise_sensitivity_in_irrelevant": 0.0,
"hallucination": 4.2,
"self_knowledge": 25.0,
"faithfulness": 70.8
}
}
Python API Example
You can also integrate RAGChecker directly into your Python code:
from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics
# initialize ragresults from json/dict
with open("examples/checking_inputs.json") as fp:
rag_results = RAGResults.from_json(fp.read())
# set-up the evaluator
evaluator = RAGChecker(
extractor_name="bedrock/meta.llama3-1-70b-instruct-v1:0",
checker_name="bedrock/meta.llama3-1-70b-instruct-v1:0",
batch_size_extractor=32,
batch_size_checker=32
)
# evaluate results with selected metrics or certain groups, e.g., retriever_metrics, generator_metrics, all_metrics
evaluator.evaluate(rag_results, all_metrics)
print(rag_results)
Why Use RAGChecker
RAGChecker empowers developers and researchers to thoroughly evaluate, diagnose, and enhance their RAG systems with precision and depth. Its key benefits include:
- Holistic Evaluation: Offers
Overall Metricsfor a comprehensive assessment of the entire RAG pipeline. - Diagnostic Metrics: Provides
Diagnostic Retriever MetricsandDiagnostic Generator Metricsto analyze specific components, offering valuable insights for targeted improvements. - Fine-grained Evaluation: Utilizes
claim-level entailmentoperations for highly detailed evaluation. - Benchmark Dataset: Includes a comprehensive RAG benchmark dataset for robust testing.
- Meta-Evaluation: Features a human-annotated preference dataset to correlate RAGChecker's results with human judgments.
- LlamaIndex Integration: Seamlessly integrates with LlamaIndex, making it a powerful evaluation tool for RAG applications built with LlamaIndex.
Links
- GitHub Repository: https://github.com/amazon-science/RAGChecker
- RAGChecker Paper (arXiv): https://arxiv.org/pdf/2408.08067
- Tutorial (English): https://github.com/amazon-science/RAGChecker/blob/main/tutorial/ragchecker_tutorial_en.md
- LlamaIndex Integration Documentation: https://docs.llamaindex.ai/en/latest/examples/evaluation/RAGChecker/
Related repositories
Similar repositories that may be relevant next.

rerankers: Unified API for Reranking and Cross-Encoder Models
July 4, 2026
rerankers is a lightweight, low-dependency Python library that provides a unified API for various reranking and cross-encoder models. It simplifies the integration of different reranking approaches into retrieval architectures, offering a consistent interface for diverse models like cross-encoders, RankGPT, T5, and API-based rerankers. This library aims to make reranking more accessible and easier to implement for developers.

LLM Compressor: Optimize LLMs for Deployment with vLLM
July 4, 2026
LLM Compressor is a Transformers-compatible Python library designed to apply various compression algorithms to Large Language Models (LLMs). It enables optimized deployment, especially with vLLM, by offering a comprehensive set of quantization techniques for weights, activations, and KV Cache. This tool seamlessly integrates with Hugging Face models, making LLM optimization accessible and efficient.

LightLLM: A Lightweight and High-Speed LLM Inference and Serving Framework
July 4, 2026
LightLLM is a Python-based framework designed for efficient Large Language Model (LLM) inference and serving. It stands out for its lightweight architecture, impressive scalability, and high-speed performance, making it an excellent choice for deploying LLMs. The framework integrates and builds upon the strengths of various leading open-source implementations to deliver optimized results.

TensorRT-LLM: Optimizing Large Language Model Inference on NVIDIA GPUs
July 3, 2026
TensorRT-LLM is an open-source library by NVIDIA designed to optimize inference for Large Language Models (LLMs) and Visual Generation models. It offers a user-friendly Python API, state-of-the-art optimizations, and specialized kernels to ensure efficient performance on NVIDIA GPUs. This powerful tool enables developers to deploy LLMs with high throughput and low latency, from single-GPU setups to multi-node deployments.
Source repository
Open the original repository on GitHub.