# Lighteval: Your All-in-One Toolkit for LLM Evaluation

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Source: osrepos.com
Repository profile: https://osrepos.com/repo/huggingface-lighteval
Generated for open source discovery and AI-assisted research.

Lighteval is a comprehensive toolkit from Hugging Face for evaluating Large Language Models (LLMs) across various backends. It enables users to dive deep into model performance by saving detailed, sample-by-sample results and supports over 1000 evaluation tasks. The framework offers extensive customization options, allowing users to create custom tasks and metrics tailored to their specific needs.

GitHub: https://github.com/huggingface/lighteval
OSRepos URL: https://osrepos.com/repo/huggingface-lighteval

## Summary

Lighteval is a comprehensive toolkit from Hugging Face for evaluating Large Language Models (LLMs) across various backends. It enables users to dive deep into model performance by saving detailed, sample-by-sample results and supports over 1000 evaluation tasks. The framework offers extensive customization options, allowing users to create custom tasks and metrics tailored to their specific needs.

## Topics

- evaluation
- evaluation-framework
- evaluation-metrics
- huggingface
- Python
- LLM
- AI
- machine-learning

## Repository Information

Last analyzed by OSRepos: Wed Jul 01 2026 08:35:47 GMT+0100 (Western European Summer Time)
Detail views: 2
GitHub clicks: 1

## Safety Notice

OSRepos shares public repositories for knowledge and discovery only. Review source code, dependencies, licenses, and security implications before running or installing anything.

## Content

## Introduction

Lighteval is your all-in-one toolkit for evaluating Large Language Models (LLMs) across multiple backends, whether your model is being served remotely or is already loaded in memory. Developed by Hugging Face's Leaderboard and Evals Team, Lighteval helps you dive deep into your model's performance by saving and exploring detailed, sample-by-sample results, enabling effective debugging and comparison.

With Lighteval, customization is at your fingertips. You can browse over 1000 existing tasks and metrics or effortlessly create your own custom tasks and metrics. The framework supports a wide array of evaluation tasks across domains like Knowledge, Math and Code, Chat Model Evaluation, Multilingual Evaluation, and Core Language Understanding.

## Installation

Lighteval is currently untested on Windows, but it should be fully functional on Mac and Linux systems.

To install Lighteval, use pip:

bash
pip install lighteval


Lighteval allows for many extras during installation, which can be found in the official documentation. If you plan to push results to the Hugging Face Hub, remember to add your access token as an environment variable:

shell
hf auth login


## Examples

Lighteval offers flexible entry points for model evaluation, including command-line interface (CLI) and Python API options.

### CLI Example

Here's a quick command to evaluate a model using a remote inference service:

shell
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond


This command evaluates the specified model on the `gpqa:diamond` benchmark using Hugging Face's inference providers.

### Python API Example

For models already loaded in memory, you can use the Python API:

python
from transformers import AutoModelForCausalLM

from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters


MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
BENCHMARKS = "gsm8k"

evaluation_tracker = EvaluationTracker(output_dir="./results")
pipeline_params = PipelineParameters(
    launcher_type=ParallelismManager.NONE,
    max_samples=2
)

model = AutoModelForCausalLM.from_pretrained(
  MODEL_NAME, device_map="auto"
)
config = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)
model = TransformersModel.from_model(model, config)

pipeline = Pipeline(
    model=model,
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    tasks=BENCHMARKS,
)

results = pipeline.evaluate()
pipeline.show_results()
results = pipeline.get_results()


This Python script demonstrates how to load a model, configure evaluation parameters, and run benchmarks using Lighteval's pipeline.

## Why Use Lighteval

Lighteval stands out as a powerful tool for LLM evaluation due to several key advantages:

*   **Flexibility**: Evaluate models served remotely or already in memory, supporting multiple backends like inspect-ai, Accelerate, Nanotron, vLLM, SGLang, and various inference endpoints.
*   **Comprehensive Evaluation**: Access to over 1000 evaluation tasks and popular benchmarks across diverse domains and languages, ensuring thorough model assessment.
*   **Detailed Insights**: Save and explore sample-by-sample results, providing granular data for in-depth debugging and performance comparison.
*   **Customization**: Easily create custom tasks and metrics to fit unique evaluation requirements, adapting the framework to specific research or application needs.
*   **Community-driven**: Inspired by leading evaluation frameworks like Eleuther's AI Harness and Stanford's HELM, Lighteval benefits from a vibrant community and welcomes contributions.

## Links

*   **GitHub Repository**: [huggingface/lighteval](https://github.com/huggingface/lighteval)
*   **Documentation**: [Lighteval Documentation](https://huggingface.co/docs/lighteval/main/en/index)
*   **Open Benchmark Index**: [Open Benchmark Index](https://huggingface.co/spaces/OpenEvals/open_benchmark_index)