# PromptBench: A Unified Framework for LLM Evaluation and Robustness

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Source: osrepos.com
Repository profile: https://osrepos.com/repo/microsoft-promptbench
Generated for open source discovery and AI-assisted research.

PromptBench is a comprehensive Python library designed for the evaluation and understanding of Large Language Models (LLMs). It provides a unified framework for assessing model performance, exploring various prompt engineering techniques, and evaluating robustness against adversarial attacks. This tool empowers researchers to conduct in-depth analyses of LLMs across diverse datasets and models.

GitHub: https://github.com/microsoft/promptbench
OSRepos URL: https://osrepos.com/repo/microsoft-promptbench

## Summary

PromptBench is a comprehensive Python library designed for the evaluation and understanding of Large Language Models (LLMs). It provides a unified framework for assessing model performance, exploring various prompt engineering techniques, and evaluating robustness against adversarial attacks. This tool empowers researchers to conduct in-depth analyses of LLMs across diverse datasets and models.

## Topics

- large-language-models
- LLM Evaluation
- prompt-engineering
- adversarial-attacks
- benchmark
- Python
- robustness
- AI

## Repository Information

Last analyzed by OSRepos: Wed Jul 01 2026 00:43:57 GMT+0100 (Western European Summer Time)
Detail views: 3
GitHub clicks: 1

## Safety Notice

OSRepos shares public repositories for knowledge and discovery only. Review source code, dependencies, licenses, and security implications before running or installing anything.

## Content

## Introduction
PromptBench is a powerful, PyTorch-based Python package developed by Microsoft for the comprehensive evaluation and understanding of Large Language Models (LLMs). Described as a unified evaluation framework, it provides researchers with user-friendly APIs to conduct in-depth analyses of LLMs. The project aims to offer quick model performance assessment, facilitate prompt engineering research, evaluate robustness against adversarial prompts, and integrate dynamic evaluation methods to mitigate data contamination. It also supports efficient multi-prompt evaluation and a wide array of language and multi-modal datasets and models. For more technical details, refer to its [technical report](https://arxiv.org/abs/2312.07910).

## Installation
Getting started with PromptBench is straightforward.

### Install via `pip`:
For a quick setup, you can install the package directly using pip:
sh
pip install promptbench

Note that the pip installation might be slightly behind the latest updates.

### Install via GitHub:
For the most recent features or development, clone the repository and install from source:
sh
git clone https://github.com/microsoftarchive/promptbench.git
cd promptbench

Then, create a conda environment and install the required packages:
sh
conda create --name promptbench python=3.9
conda activate promptbench
pip install -r requirements.txt

For prompt attacks, you will also need to install [TextAttack](https://github.com/QData/TextAttack).

## Examples
PromptBench is designed to be easy to use and extend. After installation, you can import it:
python
import promptbench as pb

The repository provides several tutorials to help you get familiar with its functionalities:
*   **Evaluate models on existing benchmarks:** Refer to `examples/basic.ipynb` for constructing your evaluation pipeline, and `examples/multimodal.ipynb` for multi-modal evaluations.
*   **Test the effects of different prompting techniques.**
*   **Examine robustness for prompt attacks:** See `examples/prompt_attack.ipynb` for constructing attacks.
*   **Use DyVal for evaluation:** Refer to `examples/dyval.ipynb` for constructing DyVal datasets.
*   **Efficient multi-prompt evaluation using PromptEval:** Check `examples/efficient_multi_prompt_eval.ipynb`.

## Why Use PromptBench
PromptBench stands out as an essential tool for anyone working with Large Language Models due to several key advantages:
*   **Unified Framework:** It offers a single, consistent API for various evaluation tasks, simplifying research workflows.
*   **Comprehensive Evaluation:** Supports standard, dynamic (DyVal), and semantic evaluation protocols, along with benchmark results and visualization analysis.
*   **Advanced Prompt Engineering:** Integrates popular techniques like Chain-of-Thought, EmotionPrompt, and Expert Prompting, allowing for in-depth analysis of their effects.
*   **Robustness Assessment:** Provides tools to simulate and evaluate black-box adversarial prompt attacks, crucial for understanding model vulnerabilities.
*   **Broad Model and Dataset Support:** Compatible with a wide range of open-source and proprietary language and multi-modal models, as well as numerous datasets including GLUE, MMLU, Big-Bench Hard, VQAv2, and MMMU.
*   **Efficiency:** Includes methods like PromptEval for efficient multi-prompt evaluation, significantly reducing the data required for accurate performance prediction.
*   **Active Development:** Continuously updated with support for new models (e.g., GPT-4o, Gemini, Mistral) and datasets, ensuring it remains at the forefront of LLM evaluation.

## Links
*   **GitHub Repository:** [https://github.com/microsoftarchive/promptbench](https://github.com/microsoftarchive/promptbench)
*   **Technical Report (Paper):** [https://arxiv.org/abs/2312.07910](https://arxiv.org/abs/2312.07910)
*   **Documentation:** [https://promptbench.readthedocs.io/en/latest/](https://promptbench.readthedocs.io/en/latest/)
*   **Leaderboard:** [https://llm-eval.github.io/pages/leaderboard.html](https://llm-eval.github.io/pages/leaderboard.html)