EvalPlus: Rigorous Evaluation for LLM-Synthesized Code

This repository profile is provided by osrepos.com, an open source repository discovery platform.

EvalPlus: Rigorous Evaluation for LLM-Synthesized Code

Summary

EvalPlus is a robust framework designed for the rigorous evaluation of code generated by Large Language Models (LLMs). It extends standard benchmarks like HumanEval and MBPP with significantly more tests, offering precise assessment of code correctness and efficiency. This tool is crucial for developers and researchers aiming to thoroughly validate LLM-synthesized code.

Repository Information

Analyzed by OSRepos on June 30, 2026

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

EvalPlus is a comprehensive and rigorous framework for evaluating code generated by Large Language Models (LLMs). Recognized at conferences like NeurIPS 2023 and COLM 2024, it addresses the critical need to assess the correctness and efficiency of AI-synthesized code. The project expands existing benchmarks such as HumanEval and MBPP, introducing HumanEval+ (80x more tests) and MBPP+ (35x more tests), in addition to featuring EvalPerf for evaluating code efficiency.

Installation

To get started with EvalPlus for code correctness evaluation, you can install it via pip:

pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release

To include code efficiency evaluation with EvalPerf, use:

pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release

Examples

Code Correctness Evaluation (HumanEval(+) or MBPP(+))

Run an LLM model evaluation on the HumanEval or MBPP dataset:

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --greedy

Code Efficiency Evaluation (EvalPerf)

To evaluate the efficiency of generated code, first enable perf_event_paranoid and then run the command:

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm

Why Use EvalPlus?

EvalPlus offers unparalleled evaluation for LLM-generated code, standing out for:

  • Precise Evaluation: With significantly expanded test suites (HumanEval+, MBPP+), EvalPlus provides a much more detailed and accurate assessment of code correctness, revealing the true capabilities of LLMs. You can view the latest LLM rankings on its official leaderboard.
  • Coding Rigorousness: By observing score differences before and after using EvalPlus's rigorous tests, it's possible to determine how robust the generated code is. A smaller drop indicates greater rigorousness, while a larger drop suggests more fragile code.
  • Code Efficiency: Beyond correctness, EvalPerf evaluates the efficiency of LLM-generated code, using performance-exercising coding tasks and test inputs, ensuring the code is not only correct but also optimized.

Links

Source repository

Open the original repository on GitHub.

View on GitHub
OS
OSRepos

Analysis and discovery of open source repositories. Find interesting projects and follow their updates.

Monitor your website with YourWebsiteScore

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of third-party repository code is at your own risk. Always review source code, dependencies, licenses, and security implications before running anything.

© 2025 OSRepos. Built with Nuxt 3 and lots of ❤️