# EvalPlus: Rigorous Evaluation for LLM-Synthesized Code

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Source: osrepos.com
Repository profile: https://osrepos.com/repo/evalplus-evalplus
Generated for open source discovery and AI-assisted research.

EvalPlus is a robust framework designed for the rigorous evaluation of code generated by Large Language Models (LLMs). It extends standard benchmarks like HumanEval and MBPP with significantly more tests, offering precise assessment of code correctness and efficiency. This tool is crucial for developers and researchers aiming to thoroughly validate LLM-synthesized code.

GitHub: https://github.com/evalplus/evalplus
OSRepos URL: https://osrepos.com/repo/evalplus-evalplus

## Summary

EvalPlus is a robust framework designed for the rigorous evaluation of code generated by Large Language Models (LLMs). It extends standard benchmarks like HumanEval and MBPP with significantly more tests, offering precise assessment of code correctness and efficiency. This tool is crucial for developers and researchers aiming to thoroughly validate LLM-synthesized code.

## Topics

- benchmark
- large-language-models
- program-synthesis
- code evaluation
- testing
- Python
- AI
- efficiency

## Repository Information

Last analyzed by OSRepos: Tue Jun 30 2026 17:19:36 GMT+0100 (Western European Summer Time)
Detail views: 1
GitHub clicks: 0

## Safety Notice

OSRepos shares public repositories for knowledge and discovery only. Review source code, dependencies, licenses, and security implications before running or installing anything.

## Content

## Introduction

EvalPlus is a comprehensive and rigorous framework for evaluating code generated by Large Language Models (LLMs). Recognized at conferences like NeurIPS 2023 and COLM 2024, it addresses the critical need to assess the correctness and efficiency of AI-synthesized code. The project expands existing benchmarks such as HumanEval and MBPP, introducing HumanEval+ (80x more tests) and MBPP+ (35x more tests), in addition to featuring EvalPerf for evaluating code efficiency.

## Installation

To get started with EvalPlus for code correctness evaluation, you can install it via pip:

bash
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release


To include code efficiency evaluation with EvalPerf, use:

bash
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release


## Examples

### Code Correctness Evaluation (HumanEval(+) or MBPP(+))

Run an LLM model evaluation on the HumanEval or MBPP dataset:

bash
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --greedy


### Code Efficiency Evaluation (EvalPerf)

To evaluate the efficiency of generated code, first enable `perf_event_paranoid` and then run the command:

bash
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm


## Why Use EvalPlus?

EvalPlus offers unparalleled evaluation for LLM-generated code, standing out for:

*   **Precise Evaluation**: With significantly expanded test suites (HumanEval+, MBPP+), EvalPlus provides a much more detailed and accurate assessment of code correctness, revealing the true capabilities of LLMs. You can view the latest LLM rankings on its [official leaderboard](https://evalplus.github.io/leaderboard.html).
*   **Coding Rigorousness**: By observing score differences before and after using EvalPlus's rigorous tests, it's possible to determine how robust the generated code is. A smaller drop indicates greater rigorousness, while a larger drop suggests more fragile code.
*   **Code Efficiency**: Beyond correctness, EvalPerf evaluates the efficiency of LLM-generated code, using performance-exercising coding tasks and test inputs, ensuring the code is not only correct but also optimized.

## Links

*   **GitHub Repository**: [https://github.com/evalplus/evalplus](https://github.com/evalplus/evalplus)
*   **Official Leaderboard**: [https://evalplus.github.io/leaderboard.html](https://evalplus.github.io/leaderboard.html)
*   **EvalPlus Paper (NeurIPS'23)**: [https://openreview.net/forum?id=1qvx610Cu7](https://openreview.net/forum?id=1qvx610Cu7)
*   **EvalPerf Paper (COLM'24)**: [https://openreview.net/forum?id=IBCBMeAhmC](https://openreview.net/forum?id=IBCBMeAhmC)
*   **Hugging Face**: [https://huggingface.co/evalplus/](https://huggingface.co/evalplus/)
*   **PyPI Package**: [https://pypi.org/project/evalplus/](https://pypi.org/project/evalplus/)
*   **Docker Image**: [https://hub.docker.com/r/ganler/evalplus](https://hub.docker.com/r/ganler/evalplus)