{"name":"PromptBench: A Unified Framework for LLM Evaluation and Robustness","description":"PromptBench is a comprehensive Python library designed for the evaluation and understanding of Large Language Models (LLMs). It provides a unified framework for assessing model performance, exploring various prompt engineering techniques, and evaluating robustness against adversarial attacks. This tool empowers researchers to conduct in-depth analyses of LLMs across diverse datasets and models.","github":"https://github.com/microsoft/promptbench","url":"https://osrepos.com/repo/microsoft-promptbench","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/microsoft-promptbench","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/microsoft-promptbench.md","json":"https://osrepos.com/repo/microsoft-promptbench.json","topics":["large-language-models","LLM Evaluation","prompt-engineering","adversarial-attacks","benchmark","Python","robustness","AI"],"keywords":["large-language-models","LLM Evaluation","prompt-engineering","adversarial-attacks","benchmark","Python","robustness","AI"],"stars":null,"summary":"PromptBench is a comprehensive Python library designed for the evaluation and understanding of Large Language Models (LLMs). It provides a unified framework for assessing model performance, exploring various prompt engineering techniques, and evaluating robustness against adversarial attacks. This tool empowers researchers to conduct in-depth analyses of LLMs across diverse datasets and models.","content":"## Introduction\nPromptBench is a powerful, PyTorch-based Python package developed by Microsoft for the comprehensive evaluation and understanding of Large Language Models (LLMs). Described as a unified evaluation framework, it provides researchers with user-friendly APIs to conduct in-depth analyses of LLMs. The project aims to offer quick model performance assessment, facilitate prompt engineering research, evaluate robustness against adversarial prompts, and integrate dynamic evaluation methods to mitigate data contamination. It also supports efficient multi-prompt evaluation and a wide array of language and multi-modal datasets and models. For more technical details, refer to its [technical report](https://arxiv.org/abs/2312.07910).\n\n## Installation\nGetting started with PromptBench is straightforward.\n\n### Install via `pip`:\nFor a quick setup, you can install the package directly using pip:\nsh\npip install promptbench\n\nNote that the pip installation might be slightly behind the latest updates.\n\n### Install via GitHub:\nFor the most recent features or development, clone the repository and install from source:\nsh\ngit clone https://github.com/microsoftarchive/promptbench.git\ncd promptbench\n\nThen, create a conda environment and install the required packages:\nsh\nconda create --name promptbench python=3.9\nconda activate promptbench\npip install -r requirements.txt\n\nFor prompt attacks, you will also need to install [TextAttack](https://github.com/QData/TextAttack).\n\n## Examples\nPromptBench is designed to be easy to use and extend. After installation, you can import it:\npython\nimport promptbench as pb\n\nThe repository provides several tutorials to help you get familiar with its functionalities:\n*   **Evaluate models on existing benchmarks:** Refer to `examples/basic.ipynb` for constructing your evaluation pipeline, and `examples/multimodal.ipynb` for multi-modal evaluations.\n*   **Test the effects of different prompting techniques.**\n*   **Examine robustness for prompt attacks:** See `examples/prompt_attack.ipynb` for constructing attacks.\n*   **Use DyVal for evaluation:** Refer to `examples/dyval.ipynb` for constructing DyVal datasets.\n*   **Efficient multi-prompt evaluation using PromptEval:** Check `examples/efficient_multi_prompt_eval.ipynb`.\n\n## Why Use PromptBench\nPromptBench stands out as an essential tool for anyone working with Large Language Models due to several key advantages:\n*   **Unified Framework:** It offers a single, consistent API for various evaluation tasks, simplifying research workflows.\n*   **Comprehensive Evaluation:** Supports standard, dynamic (DyVal), and semantic evaluation protocols, along with benchmark results and visualization analysis.\n*   **Advanced Prompt Engineering:** Integrates popular techniques like Chain-of-Thought, EmotionPrompt, and Expert Prompting, allowing for in-depth analysis of their effects.\n*   **Robustness Assessment:** Provides tools to simulate and evaluate black-box adversarial prompt attacks, crucial for understanding model vulnerabilities.\n*   **Broad Model and Dataset Support:** Compatible with a wide range of open-source and proprietary language and multi-modal models, as well as numerous datasets including GLUE, MMLU, Big-Bench Hard, VQAv2, and MMMU.\n*   **Efficiency:** Includes methods like PromptEval for efficient multi-prompt evaluation, significantly reducing the data required for accurate performance prediction.\n*   **Active Development:** Continuously updated with support for new models (e.g., GPT-4o, Gemini, Mistral) and datasets, ensuring it remains at the forefront of LLM evaluation.\n\n## Links\n*   **GitHub Repository:** [https://github.com/microsoftarchive/promptbench](https://github.com/microsoftarchive/promptbench)\n*   **Technical Report (Paper):** [https://arxiv.org/abs/2312.07910](https://arxiv.org/abs/2312.07910)\n*   **Documentation:** [https://promptbench.readthedocs.io/en/latest/](https://promptbench.readthedocs.io/en/latest/)\n*   **Leaderboard:** [https://llm-eval.github.io/pages/leaderboard.html](https://llm-eval.github.io/pages/leaderboard.html)","metrics":{"detailViews":3,"githubClicks":1},"dates":{"published":null,"modified":"2026-06-30T23:43:57.000Z"}}