PromptBench: A Unified Framework for LLM Evaluation and Robustness
This repository profile is provided by osrepos.com, an open source repository discovery platform.

Summary
PromptBench is a comprehensive Python library designed for the evaluation and understanding of Large Language Models (LLMs). It provides a unified framework for assessing model performance, exploring various prompt engineering techniques, and evaluating robustness against adversarial attacks. This tool empowers researchers to conduct in-depth analyses of LLMs across diverse datasets and models.
Repository Information
Topics
Click on any tag to explore related repositories
Use at your own risk
OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.
Introduction
PromptBench is a powerful, PyTorch-based Python package developed by Microsoft for the comprehensive evaluation and understanding of Large Language Models (LLMs). Described as a unified evaluation framework, it provides researchers with user-friendly APIs to conduct in-depth analyses of LLMs. The project aims to offer quick model performance assessment, facilitate prompt engineering research, evaluate robustness against adversarial prompts, and integrate dynamic evaluation methods to mitigate data contamination. It also supports efficient multi-prompt evaluation and a wide array of language and multi-modal datasets and models. For more technical details, refer to its technical report.
Installation
Getting started with PromptBench is straightforward.
Install via pip:
For a quick setup, you can install the package directly using pip:
pip install promptbench
Note that the pip installation might be slightly behind the latest updates.
Install via GitHub:
For the most recent features or development, clone the repository and install from source:
git clone https://github.com/microsoftarchive/promptbench.git
cd promptbench
Then, create a conda environment and install the required packages:
conda create --name promptbench python=3.9
conda activate promptbench
pip install -r requirements.txt
For prompt attacks, you will also need to install TextAttack.
Examples
PromptBench is designed to be easy to use and extend. After installation, you can import it:
import promptbench as pb
The repository provides several tutorials to help you get familiar with its functionalities:
- Evaluate models on existing benchmarks: Refer to
examples/basic.ipynbfor constructing your evaluation pipeline, andexamples/multimodal.ipynbfor multi-modal evaluations. - Test the effects of different prompting techniques.
- Examine robustness for prompt attacks: See
examples/prompt_attack.ipynbfor constructing attacks. - Use DyVal for evaluation: Refer to
examples/dyval.ipynbfor constructing DyVal datasets. - Efficient multi-prompt evaluation using PromptEval: Check
examples/efficient_multi_prompt_eval.ipynb.
Why Use PromptBench
PromptBench stands out as an essential tool for anyone working with Large Language Models due to several key advantages:
- Unified Framework: It offers a single, consistent API for various evaluation tasks, simplifying research workflows.
- Comprehensive Evaluation: Supports standard, dynamic (DyVal), and semantic evaluation protocols, along with benchmark results and visualization analysis.
- Advanced Prompt Engineering: Integrates popular techniques like Chain-of-Thought, EmotionPrompt, and Expert Prompting, allowing for in-depth analysis of their effects.
- Robustness Assessment: Provides tools to simulate and evaluate black-box adversarial prompt attacks, crucial for understanding model vulnerabilities.
- Broad Model and Dataset Support: Compatible with a wide range of open-source and proprietary language and multi-modal models, as well as numerous datasets including GLUE, MMLU, Big-Bench Hard, VQAv2, and MMMU.
- Efficiency: Includes methods like PromptEval for efficient multi-prompt evaluation, significantly reducing the data required for accurate performance prediction.
- Active Development: Continuously updated with support for new models (e.g., GPT-4o, Gemini, Mistral) and datasets, ensuring it remains at the forefront of LLM evaluation.
Links
- GitHub Repository: https://github.com/microsoftarchive/promptbench
- Technical Report (Paper): https://arxiv.org/abs/2312.07910
- Documentation: https://promptbench.readthedocs.io/en/latest/
- Leaderboard: https://llm-eval.github.io/pages/leaderboard.html
Related repositories
Similar repositories that may be relevant next.

LangTest: A Comprehensive Library for Safe & Effective Language Models
June 30, 2026
LangTest is an open-source Python library dedicated to ensuring the safety and effectiveness of language models. It offers a comprehensive framework for testing model quality, covering robustness, bias, fairness, and accuracy across various NLP tasks and LLM providers. With LangTest, developers can generate and execute over 60 distinct test types with just one line of code, promoting responsible AI development.

EvalPlus: Rigorous Evaluation for LLM-Synthesized Code
June 30, 2026
EvalPlus is a robust framework designed for the rigorous evaluation of code generated by Large Language Models (LLMs). It extends standard benchmarks like HumanEval and MBPP with significantly more tests, offering precise assessment of code correctness and efficiency. This tool is crucial for developers and researchers aiming to thoroughly validate LLM-synthesized code.

XGrammar: Fast, Flexible, and Portable Structured Generation for LLMs
June 27, 2026
XGrammar is an open-source library for efficient, flexible, and portable structured generation, developed by mlc-ai. It leverages constrained decoding to guarantee 100% structural correctness for outputs like JSON and regex. Optimized for near-zero overhead, XGrammar offers universal deployment across various platforms, hardware, and programming languages, making it a leading solution for structured output from large language models.

LLM Guard: The Security Toolkit for LLM Interactions
June 26, 2026
LLM Guard is an open-source security toolkit developed by Protect AI, designed to fortify the safety of Large Language Models. It offers comprehensive protection against various threats, including prompt injection, data leakage, and harmful language, ensuring secure and reliable LLM interactions.
Source repository
Open the original repository on GitHub.