{"name":"Lighteval: Your All-in-One Toolkit for LLM Evaluation","description":"Lighteval is a comprehensive toolkit from Hugging Face for evaluating Large Language Models (LLMs) across various backends. It enables users to dive deep into model performance by saving detailed, sample-by-sample results and supports over 1000 evaluation tasks. The framework offers extensive customization options, allowing users to create custom tasks and metrics tailored to their specific needs.","github":"https://github.com/huggingface/lighteval","url":"https://osrepos.com/repo/huggingface-lighteval","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/huggingface-lighteval","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/huggingface-lighteval.md","json":"https://osrepos.com/repo/huggingface-lighteval.json","topics":["evaluation","evaluation-framework","evaluation-metrics","huggingface","Python","LLM","AI","machine-learning"],"keywords":["evaluation","evaluation-framework","evaluation-metrics","huggingface","Python","LLM","AI","machine-learning"],"stars":null,"summary":"Lighteval is a comprehensive toolkit from Hugging Face for evaluating Large Language Models (LLMs) across various backends. It enables users to dive deep into model performance by saving detailed, sample-by-sample results and supports over 1000 evaluation tasks. The framework offers extensive customization options, allowing users to create custom tasks and metrics tailored to their specific needs.","content":"## Introduction\n\nLighteval is your all-in-one toolkit for evaluating Large Language Models (LLMs) across multiple backends, whether your model is being served remotely or is already loaded in memory. Developed by Hugging Face's Leaderboard and Evals Team, Lighteval helps you dive deep into your model's performance by saving and exploring detailed, sample-by-sample results, enabling effective debugging and comparison.\n\nWith Lighteval, customization is at your fingertips. You can browse over 1000 existing tasks and metrics or effortlessly create your own custom tasks and metrics. The framework supports a wide array of evaluation tasks across domains like Knowledge, Math and Code, Chat Model Evaluation, Multilingual Evaluation, and Core Language Understanding.\n\n## Installation\n\nLighteval is currently untested on Windows, but it should be fully functional on Mac and Linux systems.\n\nTo install Lighteval, use pip:\n\nbash\npip install lighteval\n\n\nLighteval allows for many extras during installation, which can be found in the official documentation. If you plan to push results to the Hugging Face Hub, remember to add your access token as an environment variable:\n\nshell\nhf auth login\n\n\n## Examples\n\nLighteval offers flexible entry points for model evaluation, including command-line interface (CLI) and Python API options.\n\n### CLI Example\n\nHere's a quick command to evaluate a model using a remote inference service:\n\nshell\nlighteval eval \"hf-inference-providers/openai/gpt-oss-20b\" gpqa:diamond\n\n\nThis command evaluates the specified model on the `gpqa:diamond` benchmark using Hugging Face's inference providers.\n\n### Python API Example\n\nFor models already loaded in memory, you can use the Python API:\n\npython\nfrom transformers import AutoModelForCausalLM\n\nfrom lighteval.logging.evaluation_tracker import EvaluationTracker\nfrom lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig\nfrom lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters\n\n\nMODEL_NAME = \"meta-llama/Meta-Llama-3-8B-Instruct\"\nBENCHMARKS = \"gsm8k\"\n\nevaluation_tracker = EvaluationTracker(output_dir=\"./results\")\npipeline_params = PipelineParameters(\n    launcher_type=ParallelismManager.NONE,\n    max_samples=2\n)\n\nmodel = AutoModelForCausalLM.from_pretrained(\n  MODEL_NAME, device_map=\"auto\"\n)\nconfig = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)\nmodel = TransformersModel.from_model(model, config)\n\npipeline = Pipeline(\n    model=model,\n    pipeline_parameters=pipeline_params,\n    evaluation_tracker=evaluation_tracker,\n    tasks=BENCHMARKS,\n)\n\nresults = pipeline.evaluate()\npipeline.show_results()\nresults = pipeline.get_results()\n\n\nThis Python script demonstrates how to load a model, configure evaluation parameters, and run benchmarks using Lighteval's pipeline.\n\n## Why Use Lighteval\n\nLighteval stands out as a powerful tool for LLM evaluation due to several key advantages:\n\n*   **Flexibility**: Evaluate models served remotely or already in memory, supporting multiple backends like inspect-ai, Accelerate, Nanotron, vLLM, SGLang, and various inference endpoints.\n*   **Comprehensive Evaluation**: Access to over 1000 evaluation tasks and popular benchmarks across diverse domains and languages, ensuring thorough model assessment.\n*   **Detailed Insights**: Save and explore sample-by-sample results, providing granular data for in-depth debugging and performance comparison.\n*   **Customization**: Easily create custom tasks and metrics to fit unique evaluation requirements, adapting the framework to specific research or application needs.\n*   **Community-driven**: Inspired by leading evaluation frameworks like Eleuther's AI Harness and Stanford's HELM, Lighteval benefits from a vibrant community and welcomes contributions.\n\n## Links\n\n*   **GitHub Repository**: [huggingface/lighteval](https://github.com/huggingface/lighteval)\n*   **Documentation**: [Lighteval Documentation](https://huggingface.co/docs/lighteval/main/en/index)\n*   **Open Benchmark Index**: [Open Benchmark Index](https://huggingface.co/spaces/OpenEvals/open_benchmark_index)","metrics":{"detailViews":2,"githubClicks":1},"dates":{"published":null,"modified":"2026-07-01T07:35:47.000Z"}}