Lighteval: Your All-in-One Toolkit for LLM Evaluation
This repository profile is provided by osrepos.com, an open source repository discovery platform.
Summary
Lighteval is a comprehensive toolkit from Hugging Face for evaluating Large Language Models (LLMs) across various backends. It enables users to dive deep into model performance by saving detailed, sample-by-sample results and supports over 1000 evaluation tasks. The framework offers extensive customization options, allowing users to create custom tasks and metrics tailored to their specific needs.
Repository Information
Topics
Click on any tag to explore related repositories
Use at your own risk
OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.
Introduction
Lighteval is your all-in-one toolkit for evaluating Large Language Models (LLMs) across multiple backends, whether your model is being served remotely or is already loaded in memory. Developed by Hugging Face's Leaderboard and Evals Team, Lighteval helps you dive deep into your model's performance by saving and exploring detailed, sample-by-sample results, enabling effective debugging and comparison.
With Lighteval, customization is at your fingertips. You can browse over 1000 existing tasks and metrics or effortlessly create your own custom tasks and metrics. The framework supports a wide array of evaluation tasks across domains like Knowledge, Math and Code, Chat Model Evaluation, Multilingual Evaluation, and Core Language Understanding.
Installation
Lighteval is currently untested on Windows, but it should be fully functional on Mac and Linux systems.
To install Lighteval, use pip:
pip install lighteval
Lighteval allows for many extras during installation, which can be found in the official documentation. If you plan to push results to the Hugging Face Hub, remember to add your access token as an environment variable:
hf auth login
Examples
Lighteval offers flexible entry points for model evaluation, including command-line interface (CLI) and Python API options.
CLI Example
Here's a quick command to evaluate a model using a remote inference service:
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond
This command evaluates the specified model on the gpqa:diamond benchmark using Hugging Face's inference providers.
Python API Example
For models already loaded in memory, you can use the Python API:
from transformers import AutoModelForCausalLM
from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters
MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
BENCHMARKS = "gsm8k"
evaluation_tracker = EvaluationTracker(output_dir="./results")
pipeline_params = PipelineParameters(
launcher_type=ParallelismManager.NONE,
max_samples=2
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, device_map="auto"
)
config = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)
model = TransformersModel.from_model(model, config)
pipeline = Pipeline(
model=model,
pipeline_parameters=pipeline_params,
evaluation_tracker=evaluation_tracker,
tasks=BENCHMARKS,
)
results = pipeline.evaluate()
pipeline.show_results()
results = pipeline.get_results()
This Python script demonstrates how to load a model, configure evaluation parameters, and run benchmarks using Lighteval's pipeline.
Why Use Lighteval
Lighteval stands out as a powerful tool for LLM evaluation due to several key advantages:
- Flexibility: Evaluate models served remotely or already in memory, supporting multiple backends like inspect-ai, Accelerate, Nanotron, vLLM, SGLang, and various inference endpoints.
- Comprehensive Evaluation: Access to over 1000 evaluation tasks and popular benchmarks across diverse domains and languages, ensuring thorough model assessment.
- Detailed Insights: Save and explore sample-by-sample results, providing granular data for in-depth debugging and performance comparison.
- Customization: Easily create custom tasks and metrics to fit unique evaluation requirements, adapting the framework to specific research or application needs.
- Community-driven: Inspired by leading evaluation frameworks like Eleuther's AI Harness and Stanford's HELM, Lighteval benefits from a vibrant community and welcomes contributions.
Links
- GitHub Repository: huggingface/lighteval
- Documentation: Lighteval Documentation
- Open Benchmark Index: Open Benchmark Index
Related repositories
Similar repositories that may be relevant next.

LangWatch: The Platform for LLM Evaluations and AI Agent Testing
April 28, 2026
LangWatch is an open-source platform designed for end-to-end LLM evaluations and AI agent testing. It helps teams test, simulate, evaluate, and monitor LLM-powered agents both before release and in production. Built for robust regression testing, simulations, and production observability, LangWatch eliminates the need for custom tooling.

Promptfoo: LLM Evaluation and Red Teaming for AI Applications
March 24, 2026
Promptfoo is an open-source CLI and library designed for evaluating and red-teaming Large Language Model (LLM) applications. It enables developers to test prompts, agents, and RAGs, compare model performance, and secure AI apps through vulnerability scanning. With simple declarative configs and CI/CD integration, Promptfoo helps ship reliable and secure AI solutions.

Langsmith-sdk: Client SDK for LLM Debugging, Evaluation, and Monitoring
March 18, 2026
The Langsmith-sdk provides client SDKs for interacting with the LangSmith platform, enabling robust debugging, evaluation, and monitoring of language models and intelligent agents. It offers native integrations with both LangChain Python and LangChain JS, making it an essential tool for LLM application development.
Source repository
Open the original repository on GitHub.