EvalPlus: Rigorous Evaluation for LLM-Synthesized Code
This repository profile is provided by osrepos.com, an open source repository discovery platform.

Summary
EvalPlus is a robust framework designed for the rigorous evaluation of code generated by Large Language Models (LLMs). It extends standard benchmarks like HumanEval and MBPP with significantly more tests, offering precise assessment of code correctness and efficiency. This tool is crucial for developers and researchers aiming to thoroughly validate LLM-synthesized code.
Repository Information
Topics
Click on any tag to explore related repositories
Use at your own risk
OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.
Introduction
EvalPlus is a comprehensive and rigorous framework for evaluating code generated by Large Language Models (LLMs). Recognized at conferences like NeurIPS 2023 and COLM 2024, it addresses the critical need to assess the correctness and efficiency of AI-synthesized code. The project expands existing benchmarks such as HumanEval and MBPP, introducing HumanEval+ (80x more tests) and MBPP+ (35x more tests), in addition to featuring EvalPerf for evaluating code efficiency.
Installation
To get started with EvalPlus for code correctness evaluation, you can install it via pip:
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release
To include code efficiency evaluation with EvalPerf, use:
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release
Examples
Code Correctness Evaluation (HumanEval(+) or MBPP(+))
Run an LLM model evaluation on the HumanEval or MBPP dataset:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--greedy
Code Efficiency Evaluation (EvalPerf)
To evaluate the efficiency of generated code, first enable perf_event_paranoid and then run the command:
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm
Why Use EvalPlus?
EvalPlus offers unparalleled evaluation for LLM-generated code, standing out for:
- Precise Evaluation: With significantly expanded test suites (HumanEval+, MBPP+), EvalPlus provides a much more detailed and accurate assessment of code correctness, revealing the true capabilities of LLMs. You can view the latest LLM rankings on its official leaderboard.
- Coding Rigorousness: By observing score differences before and after using EvalPlus's rigorous tests, it's possible to determine how robust the generated code is. A smaller drop indicates greater rigorousness, while a larger drop suggests more fragile code.
- Code Efficiency: Beyond correctness, EvalPerf evaluates the efficiency of LLM-generated code, using performance-exercising coding tasks and test inputs, ensuring the code is not only correct but also optimized.
Links
- GitHub Repository: https://github.com/evalplus/evalplus
- Official Leaderboard: https://evalplus.github.io/leaderboard.html
- EvalPlus Paper (NeurIPS'23): https://openreview.net/forum?id=1qvx610Cu7
- EvalPerf Paper (COLM'24): https://openreview.net/forum?id=IBCBMeAhmC
- Hugging Face: https://huggingface.co/evalplus/
- PyPI Package: https://pypi.org/project/evalplus/
- Docker Image: https://hub.docker.com/r/ganler/evalplus
Source repository
Open the original repository on GitHub.