PromptBench: A Unified Framework for LLM Evaluation and Robustness

This repository profile is provided by osrepos.com, an open source repository discovery platform.

PromptBench: A Unified Framework for LLM Evaluation and Robustness

Summary

PromptBench is a comprehensive Python library designed for the evaluation and understanding of Large Language Models (LLMs). It provides a unified framework for assessing model performance, exploring various prompt engineering techniques, and evaluating robustness against adversarial attacks. This tool empowers researchers to conduct in-depth analyses of LLMs across diverse datasets and models.

Repository Information

Analyzed by OSRepos on July 1, 2026

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

PromptBench is a powerful, PyTorch-based Python package developed by Microsoft for the comprehensive evaluation and understanding of Large Language Models (LLMs). Described as a unified evaluation framework, it provides researchers with user-friendly APIs to conduct in-depth analyses of LLMs. The project aims to offer quick model performance assessment, facilitate prompt engineering research, evaluate robustness against adversarial prompts, and integrate dynamic evaluation methods to mitigate data contamination. It also supports efficient multi-prompt evaluation and a wide array of language and multi-modal datasets and models. For more technical details, refer to its technical report.

Installation

Getting started with PromptBench is straightforward.

Install via pip:

For a quick setup, you can install the package directly using pip:

pip install promptbench

Note that the pip installation might be slightly behind the latest updates.

Install via GitHub:

For the most recent features or development, clone the repository and install from source:

git clone https://github.com/microsoftarchive/promptbench.git
cd promptbench

Then, create a conda environment and install the required packages:

conda create --name promptbench python=3.9
conda activate promptbench
pip install -r requirements.txt

For prompt attacks, you will also need to install TextAttack.

Examples

PromptBench is designed to be easy to use and extend. After installation, you can import it:

import promptbench as pb

The repository provides several tutorials to help you get familiar with its functionalities:

  • Evaluate models on existing benchmarks: Refer to examples/basic.ipynb for constructing your evaluation pipeline, and examples/multimodal.ipynb for multi-modal evaluations.
  • Test the effects of different prompting techniques.
  • Examine robustness for prompt attacks: See examples/prompt_attack.ipynb for constructing attacks.
  • Use DyVal for evaluation: Refer to examples/dyval.ipynb for constructing DyVal datasets.
  • Efficient multi-prompt evaluation using PromptEval: Check examples/efficient_multi_prompt_eval.ipynb.

Why Use PromptBench

PromptBench stands out as an essential tool for anyone working with Large Language Models due to several key advantages:

  • Unified Framework: It offers a single, consistent API for various evaluation tasks, simplifying research workflows.
  • Comprehensive Evaluation: Supports standard, dynamic (DyVal), and semantic evaluation protocols, along with benchmark results and visualization analysis.
  • Advanced Prompt Engineering: Integrates popular techniques like Chain-of-Thought, EmotionPrompt, and Expert Prompting, allowing for in-depth analysis of their effects.
  • Robustness Assessment: Provides tools to simulate and evaluate black-box adversarial prompt attacks, crucial for understanding model vulnerabilities.
  • Broad Model and Dataset Support: Compatible with a wide range of open-source and proprietary language and multi-modal models, as well as numerous datasets including GLUE, MMLU, Big-Bench Hard, VQAv2, and MMMU.
  • Efficiency: Includes methods like PromptEval for efficient multi-prompt evaluation, significantly reducing the data required for accurate performance prediction.
  • Active Development: Continuously updated with support for new models (e.g., GPT-4o, Gemini, Mistral) and datasets, ensuring it remains at the forefront of LLM evaluation.

Links

Related repositories

Similar repositories that may be relevant next.

LangTest: A Comprehensive Library for Safe & Effective Language Models

LangTest: A Comprehensive Library for Safe & Effective Language Models

June 30, 2026

LangTest is an open-source Python library dedicated to ensuring the safety and effectiveness of language models. It offers a comprehensive framework for testing model quality, covering robustness, bias, fairness, and accuracy across various NLP tasks and LLM providers. With LangTest, developers can generate and execute over 60 distinct test types with just one line of code, promoting responsible AI development.

ai-safetyai-testinglarge-language-models
EvalPlus: Rigorous Evaluation for LLM-Synthesized Code

EvalPlus: Rigorous Evaluation for LLM-Synthesized Code

June 30, 2026

EvalPlus is a robust framework designed for the rigorous evaluation of code generated by Large Language Models (LLMs). It extends standard benchmarks like HumanEval and MBPP with significantly more tests, offering precise assessment of code correctness and efficiency. This tool is crucial for developers and researchers aiming to thoroughly validate LLM-synthesized code.

benchmarklarge-language-modelsprogram-synthesis
XGrammar: Fast, Flexible, and Portable Structured Generation for LLMs

XGrammar: Fast, Flexible, and Portable Structured Generation for LLMs

June 27, 2026

XGrammar is an open-source library for efficient, flexible, and portable structured generation, developed by mlc-ai. It leverages constrained decoding to guarantee 100% structural correctness for outputs like JSON and regex. Optimized for near-zero overhead, XGrammar offers universal deployment across various platforms, hardware, and programming languages, making it a leading solution for structured output from large language models.

large-language-modelsstructured-generationC++
LLM Guard: The Security Toolkit for LLM Interactions

LLM Guard: The Security Toolkit for LLM Interactions

June 26, 2026

LLM Guard is an open-source security toolkit developed by Protect AI, designed to fortify the safety of Large Language Models. It offers comprehensive protection against various threats, including prompt injection, data leakage, and harmful language, ensuring secure and reliable LLM interactions.

llm-securityprompt-injectionlarge-language-models

Source repository

Open the original repository on GitHub.

View on GitHub
OS
OSRepos

Analysis and discovery of open source repositories. Find interesting projects and follow their updates.

Monitor your website with YourWebsiteScore

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of third-party repository code is at your own risk. Always review source code, dependencies, licenses, and security implications before running anything.

© 2025 OSRepos. Built with Nuxt 3 and lots of ❤️