{"name":"EvalPlus: Rigorous Evaluation for LLM-Synthesized Code","description":"EvalPlus is a robust framework designed for the rigorous evaluation of code generated by Large Language Models (LLMs). It extends standard benchmarks like HumanEval and MBPP with significantly more tests, offering precise assessment of code correctness and efficiency. This tool is crucial for developers and researchers aiming to thoroughly validate LLM-synthesized code.","github":"https://github.com/evalplus/evalplus","url":"https://osrepos.com/repo/evalplus-evalplus","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/evalplus-evalplus","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/evalplus-evalplus.md","json":"https://osrepos.com/repo/evalplus-evalplus.json","topics":["benchmark","large-language-models","program-synthesis","code evaluation","testing","Python","AI","efficiency"],"keywords":["benchmark","large-language-models","program-synthesis","code evaluation","testing","Python","AI","efficiency"],"stars":null,"summary":"EvalPlus is a robust framework designed for the rigorous evaluation of code generated by Large Language Models (LLMs). It extends standard benchmarks like HumanEval and MBPP with significantly more tests, offering precise assessment of code correctness and efficiency. This tool is crucial for developers and researchers aiming to thoroughly validate LLM-synthesized code.","content":"## Introduction\n\nEvalPlus is a comprehensive and rigorous framework for evaluating code generated by Large Language Models (LLMs). Recognized at conferences like NeurIPS 2023 and COLM 2024, it addresses the critical need to assess the correctness and efficiency of AI-synthesized code. The project expands existing benchmarks such as HumanEval and MBPP, introducing HumanEval+ (80x more tests) and MBPP+ (35x more tests), in addition to featuring EvalPerf for evaluating code efficiency.\n\n## Installation\n\nTo get started with EvalPlus for code correctness evaluation, you can install it via pip:\n\nbash\npip install --upgrade \"evalplus[vllm] @ git+https://github.com/evalplus/evalplus\"\n# Or `pip install \"evalplus[vllm]\" --upgrade` for the latest stable release\n\n\nTo include code efficiency evaluation with EvalPerf, use:\n\nbash\npip install --upgrade \"evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus\"\n# Or `pip install \"evalplus[perf,vllm]\" --upgrade` for the latest stable release\n\n\n## Examples\n\n### Code Correctness Evaluation (HumanEval(+) or MBPP(+))\n\nRun an LLM model evaluation on the HumanEval or MBPP dataset:\n\nbash\nevalplus.evaluate --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n                  --dataset [humaneval|mbpp]             \\\n                  --backend vllm                         \\\n                  --greedy\n\n\n### Code Efficiency Evaluation (EvalPerf)\n\nTo evaluate the efficiency of generated code, first enable `perf_event_paranoid` and then run the command:\n\nbash\nsudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf\nevalplus.evalperf --model \"ise-uiuc/Magicoder-S-DS-6.7B\" --backend vllm\n\n\n## Why Use EvalPlus?\n\nEvalPlus offers unparalleled evaluation for LLM-generated code, standing out for:\n\n*   **Precise Evaluation**: With significantly expanded test suites (HumanEval+, MBPP+), EvalPlus provides a much more detailed and accurate assessment of code correctness, revealing the true capabilities of LLMs. You can view the latest LLM rankings on its [official leaderboard](https://evalplus.github.io/leaderboard.html).\n*   **Coding Rigorousness**: By observing score differences before and after using EvalPlus's rigorous tests, it's possible to determine how robust the generated code is. A smaller drop indicates greater rigorousness, while a larger drop suggests more fragile code.\n*   **Code Efficiency**: Beyond correctness, EvalPerf evaluates the efficiency of LLM-generated code, using performance-exercising coding tasks and test inputs, ensuring the code is not only correct but also optimized.\n\n## Links\n\n*   **GitHub Repository**: [https://github.com/evalplus/evalplus](https://github.com/evalplus/evalplus)\n*   **Official Leaderboard**: [https://evalplus.github.io/leaderboard.html](https://evalplus.github.io/leaderboard.html)\n*   **EvalPlus Paper (NeurIPS'23)**: [https://openreview.net/forum?id=1qvx610Cu7](https://openreview.net/forum?id=1qvx610Cu7)\n*   **EvalPerf Paper (COLM'24)**: [https://openreview.net/forum?id=IBCBMeAhmC](https://openreview.net/forum?id=IBCBMeAhmC)\n*   **Hugging Face**: [https://huggingface.co/evalplus/](https://huggingface.co/evalplus/)\n*   **PyPI Package**: [https://pypi.org/project/evalplus/](https://pypi.org/project/evalplus/)\n*   **Docker Image**: [https://hub.docker.com/r/ganler/evalplus](https://hub.docker.com/r/ganler/evalplus)","metrics":{"detailViews":1,"githubClicks":0},"dates":{"published":null,"modified":"2026-06-30T16:19:36.000Z"}}