EasyJailbreak: A Python Framework for Adversarial LLM Jailbreak Prompts

Summary

EasyJailbreak is an intuitive Python framework designed for generating adversarial jailbreak prompts for Large Language Models (LLMs). It provides a structured approach to decompose the jailbreaking process into iterative steps, offering components for mutation, attack, and evaluation. This tool is ideal for researchers and developers focused on LLM security and understanding model vulnerabilities.

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

EasyJailbreak is an easy-to-use Python framework specifically designed for researchers and developers focusing on Large Language Model (LLM) security. It provides a robust platform for generating adversarial jailbreak prompts by assembling various methods. The framework decomposes the mainstream jailbreaking process into several iterable steps: initializing mutation seeds, selecting suitable seeds, adding constraints, mutating, attacking, and evaluating. This modular design creates a flexible playground for further research and experimentation in LLM safety and vulnerability.

For more in-depth information, you can refer to the official paper, explore different LLMs' jailbreak results on the EasyJailbreak Website, and consult the detailed documentation for API and parameter explanations.

Installation

To get started with EasyJailbreak, ensure you have python>=3.9 installed. There are two primary methods for installation:

For users who only require the collected approaches (recipes):
```
pip install easyjailbreak
```

For users interested in adding new components (e.g., new mutate or evaluate methods):

git clone https://github.com/EasyJailbreak/EasyJailbreak.git
cd EasyJailbreak
pip install -e .

Examples

EasyJailbreak provides a straightforward API to utilize its pre-implemented attack "recipes" on various models. Here's an example demonstrating how to use the PAIR recipe:

from easyjailbreak.attacker.PAIR_chao_2023 import PAIR
from easyjailbreak.datasets import JailbreakDataset
from easyjailbreak.models.huggingface_model import from_pretrained
from easyjailbreak.models.openai_model import OpenaiModel

# First, prepare models and datasets.
attack_model = from_pretrained(model_name_or_path='lmsys/vicuna-13b-v1.5',
                               model_name='vicuna_v1.1')
target_model = OpenaiModel(model_name='gpt-4',
                         api_keys='INPUT YOUR KEY HERE!!!')
eval_model = OpenaiModel(model_name='gpt-4',
                         api_keys='INPUT YOUR KEY HERE!!!')
dataset = JailbreakDataset('AdvBench')

# Then instantiate the recipe.
attacker = PAIR(attack_model=attack_model,
                target_model=target_model,
                eval_model=eval_model,
                jailbreak_datasets=dataset)

# Finally, start jailbreaking.
attacker.attack(save_path='vicuna-13b-v1.5_gpt4_gpt4_AdvBench_result.jsonl')

For more advanced customization, such as loading models, datasets, initializing seeds, and instantiating individual components (Selectors, Mutators, Constraints, Evaluators), refer to the comprehensive documentation.

Why Use EasyJailbreak?

EasyJailbreak stands out as a valuable tool for several reasons:

Ease of Use: It offers an intuitive Python framework, simplifying the complex process of generating adversarial prompts.
Modular Design: The framework's decomposition into distinct, iterable steps allows for flexible experimentation and the development of custom attack methods.
Comprehensive Recipes: It collects and implements numerous attack recipes from relevant papers, providing a ready-to-use toolkit for evaluating LLM vulnerabilities.
LLM Security Focus: Designed specifically for LLM security research, it helps identify and understand potential weaknesses in large language models.
Extensibility: Researchers can easily integrate new components, such as novel mutation techniques or evaluation metrics, to push the boundaries of LLM safety research.