Lighteval: Your All-in-One Toolkit for LLM Evaluation

Introduction

Lighteval is your all-in-one toolkit for evaluating Large Language Models (LLMs) across multiple backends, whether your model is being served remotely or is already loaded in memory. Developed by Hugging Face's Leaderboard and Evals Team, Lighteval helps you dive deep into your model's performance by saving and exploring detailed, sample-by-sample results, enabling effective debugging and comparison.

With Lighteval, customization is at your fingertips. You can browse over 1000 existing tasks and metrics or effortlessly create your own custom tasks and metrics. The framework supports a wide array of evaluation tasks across domains like Knowledge, Math and Code, Chat Model Evaluation, Multilingual Evaluation, and Core Language Understanding.

Installation

Lighteval is currently untested on Windows, but it should be fully functional on Mac and Linux systems.

To install Lighteval, use pip:

pip install lighteval

Lighteval allows for many extras during installation, which can be found in the official documentation. If you plan to push results to the Hugging Face Hub, remember to add your access token as an environment variable:

hf auth login

Examples

Lighteval offers flexible entry points for model evaluation, including command-line interface (CLI) and Python API options.

CLI Example

Here's a quick command to evaluate a model using a remote inference service:

lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond

This command evaluates the specified model on the gpqa:diamond benchmark using Hugging Face's inference providers.

Python API Example

For models already loaded in memory, you can use the Python API:

from transformers import AutoModelForCausalLM

from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters


MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
BENCHMARKS = "gsm8k"

evaluation_tracker = EvaluationTracker(output_dir="./results")
pipeline_params = PipelineParameters(
    launcher_type=ParallelismManager.NONE,
    max_samples=2
)

model = AutoModelForCausalLM.from_pretrained(
  MODEL_NAME, device_map="auto"
)
config = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)
model = TransformersModel.from_model(model, config)

pipeline = Pipeline(
    model=model,
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    tasks=BENCHMARKS,
)

results = pipeline.evaluate()
pipeline.show_results()
results = pipeline.get_results()

This Python script demonstrates how to load a model, configure evaluation parameters, and run benchmarks using Lighteval's pipeline.

Why Use Lighteval

Lighteval stands out as a powerful tool for LLM evaluation due to several key advantages:

Flexibility: Evaluate models served remotely or already in memory, supporting multiple backends like inspect-ai, Accelerate, Nanotron, vLLM, SGLang, and various inference endpoints.
Comprehensive Evaluation: Access to over 1000 evaluation tasks and popular benchmarks across diverse domains and languages, ensuring thorough model assessment.
Detailed Insights: Save and explore sample-by-sample results, providing granular data for in-depth debugging and performance comparison.
Customization: Easily create custom tasks and metrics to fit unique evaluation requirements, adapting the framework to specific research or application needs.
Community-driven: Inspired by leading evaluation frameworks like Eleuther's AI Harness and Stanford's HELM, Lighteval benefits from a vibrant community and welcomes contributions.

Lighteval: Your All-in-One Toolkit for LLM Evaluation

Summary

Repository Information

Topics

Use at your own risk

Introduction

Installation

Examples

CLI Example

Python API Example

Why Use Lighteval

Links

Related repositories

LangWatch: The Platform for LLM Evaluations and AI Agent Testing

Promptfoo: LLM Evaluation and Red Teaming for AI Applications

Langsmith-sdk: Client SDK for LLM Debugging, Evaluation, and Monitoring

Source repository