PromptBench: A Unified Framework for LLM Evaluation and Robustness

Summary

PromptBench is a comprehensive Python library designed for the evaluation and understanding of Large Language Models (LLMs). It provides a unified framework for assessing model performance, exploring various prompt engineering techniques, and evaluating robustness against adversarial attacks. This tool empowers researchers to conduct in-depth analyses of LLMs across diverse datasets and models.

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

PromptBench is a powerful, PyTorch-based Python package developed by Microsoft for the comprehensive evaluation and understanding of Large Language Models (LLMs). Described as a unified evaluation framework, it provides researchers with user-friendly APIs to conduct in-depth analyses of LLMs. The project aims to offer quick model performance assessment, facilitate prompt engineering research, evaluate robustness against adversarial prompts, and integrate dynamic evaluation methods to mitigate data contamination. It also supports efficient multi-prompt evaluation and a wide array of language and multi-modal datasets and models. For more technical details, refer to its technical report.

Installation

Getting started with PromptBench is straightforward.

Install via `pip`:

For a quick setup, you can install the package directly using pip:

pip install promptbench

Note that the pip installation might be slightly behind the latest updates.

Install via GitHub:

For the most recent features or development, clone the repository and install from source:

git clone https://github.com/microsoftarchive/promptbench.git
cd promptbench

Then, create a conda environment and install the required packages:

conda create --name promptbench python=3.9
conda activate promptbench
pip install -r requirements.txt

For prompt attacks, you will also need to install TextAttack.

Examples

PromptBench is designed to be easy to use and extend. After installation, you can import it:

import promptbench as pb

The repository provides several tutorials to help you get familiar with its functionalities:

Evaluate models on existing benchmarks: Refer to examples/basic.ipynb for constructing your evaluation pipeline, and examples/multimodal.ipynb for multi-modal evaluations.
Test the effects of different prompting techniques.
Examine robustness for prompt attacks: See examples/prompt_attack.ipynb for constructing attacks.
Use DyVal for evaluation: Refer to examples/dyval.ipynb for constructing DyVal datasets.
Efficient multi-prompt evaluation using PromptEval: Check examples/efficient_multi_prompt_eval.ipynb.

Why Use PromptBench

PromptBench stands out as an essential tool for anyone working with Large Language Models due to several key advantages:

Unified Framework: It offers a single, consistent API for various evaluation tasks, simplifying research workflows.
Comprehensive Evaluation: Supports standard, dynamic (DyVal), and semantic evaluation protocols, along with benchmark results and visualization analysis.
Advanced Prompt Engineering: Integrates popular techniques like Chain-of-Thought, EmotionPrompt, and Expert Prompting, allowing for in-depth analysis of their effects.
Robustness Assessment: Provides tools to simulate and evaluate black-box adversarial prompt attacks, crucial for understanding model vulnerabilities.
Broad Model and Dataset Support: Compatible with a wide range of open-source and proprietary language and multi-modal models, as well as numerous datasets including GLUE, MMLU, Big-Bench Hard, VQAv2, and MMMU.
Efficiency: Includes methods like PromptEval for efficient multi-prompt evaluation, significantly reducing the data required for accurate performance prediction.
Active Development: Continuously updated with support for new models (e.g., GPT-4o, Gemini, Mistral) and datasets, ensuring it remains at the forefront of LLM evaluation.

PromptBench: A Unified Framework for LLM Evaluation and Robustness

Summary

Repository Information

Topics

Use at your own risk

Introduction

Installation

Install via `pip`:

Install via GitHub:

Examples

Why Use PromptBench

Links

Related repositories

LangTest: A Comprehensive Library for Safe & Effective Language Models

EvalPlus: Rigorous Evaluation for LLM-Synthesized Code

XGrammar: Fast, Flexible, and Portable Structured Generation for LLMs

LLM Guard: The Security Toolkit for LLM Interactions

Source repository

Summary

Repository Information

Topics

Use at your own risk

Introduction

Installation

Install via pip:

Install via GitHub:

Examples

Why Use PromptBench

Links

Related repositories

LangTest: A Comprehensive Library for Safe & Effective Language Models

EvalPlus: Rigorous Evaluation for LLM-Synthesized Code

XGrammar: Fast, Flexible, and Portable Structured Generation for LLMs

LLM Guard: The Security Toolkit for LLM Interactions

Source repository

Install via `pip`: