EvalPlus: Rigorous Evaluation for LLM-Synthesized Code

Introduction

EvalPlus is a comprehensive and rigorous framework for evaluating code generated by Large Language Models (LLMs). Recognized at conferences like NeurIPS 2023 and COLM 2024, it addresses the critical need to assess the correctness and efficiency of AI-synthesized code. The project expands existing benchmarks such as HumanEval and MBPP, introducing HumanEval+ (80x more tests) and MBPP+ (35x more tests), in addition to featuring EvalPerf for evaluating code efficiency.

Installation

To get started with EvalPlus for code correctness evaluation, you can install it via pip:

pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release

To include code efficiency evaluation with EvalPerf, use:

pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release

Examples

Code Correctness Evaluation (HumanEval(+) or MBPP(+))

Run an LLM model evaluation on the HumanEval or MBPP dataset:

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --greedy

Code Efficiency Evaluation (EvalPerf)

To evaluate the efficiency of generated code, first enable perf_event_paranoid and then run the command:

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm

Why Use EvalPlus?

EvalPlus offers unparalleled evaluation for LLM-generated code, standing out for:

Precise Evaluation: With significantly expanded test suites (HumanEval+, MBPP+), EvalPlus provides a much more detailed and accurate assessment of code correctness, revealing the true capabilities of LLMs. You can view the latest LLM rankings on its official leaderboard.
Coding Rigorousness: By observing score differences before and after using EvalPlus's rigorous tests, it's possible to determine how robust the generated code is. A smaller drop indicates greater rigorousness, while a larger drop suggests more fragile code.
Code Efficiency: Beyond correctness, EvalPerf evaluates the efficiency of LLM-generated code, using performance-exercising coding tasks and test inputs, ensuring the code is not only correct but also optimized.

EvalPlus: Rigorous Evaluation for LLM-Synthesized Code

Summary

Repository Information

Topics

Use at your own risk