PEFT: State-of-the-Art Parameter-Efficient Fine-Tuning

Introduction

PEFT, or Parameter-Efficient Fine-Tuning, is a state-of-the-art library developed by Hugging Face that provides methods for efficiently adapting large pretrained models to various downstream applications. Fine-tuning massive models is often prohibitively costly due to their scale, requiring significant computational and storage resources. PEFT addresses this challenge by enabling the adaptation of these models through fine-tuning only a small number of (extra) model parameters, rather than all of them. This approach significantly decreases computational and storage costs, making advanced AI techniques more accessible. Recent state-of-the-art PEFT techniques achieve performance comparable to fully fine-tuned models.

PEFT is seamlessly integrated with popular libraries like Transformers for easy model training and inference, Diffusers for conveniently managing different adapters, and Accelerate for distributed training and inference, even for very large models.

Installation

To get started with PEFT, you can easily install it using pip:

pip install peft

Examples

Here are quick examples demonstrating how to prepare a model for training with a PEFT method like LoRA, and how to load a PEFT model for inference.

Preparing a Model for Training

This example shows how to wrap a base model and PEFT configuration with get_peft_model. For the Qwen/Qwen2.5-3B-Instruct model, you're only training a tiny fraction of the parameters.

from transformers import AutoModelForCausalLM
from peft import LoraConfig, TaskType, get_peft_model
import torch

device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
model_id = "Qwen/Qwen2.5-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    task_type=TaskType.CAUSAL_LM,
    # target_modules=["q_proj", "v_proj", ...]  # optionally indicate target modules
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# prints: trainable params: 3,686,400 || all params: 3,089,625,088 || trainable%: 0.1193

# now perform training on your dataset, e.g. using transformers Trainer, then save the model
model.save_pretrained("qwen2.5-3b-lora")

Loading a PEFT Model for Inference

To load a PEFT model for inference, you can use PeftModel.from_pretrained:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
model_id = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
model = PeftModel.from_pretrained(model, "qwen2.5-3b-lora")

inputs = tokenizer("Preheat the oven to 350 degrees and place the cookie dough", return_tensors="pt")
outputs = model.generate(**inputs.to(device), max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Why Use PEFT?

PEFT offers numerous benefits, primarily significant savings in compute and storage, making it applicable to a wide range of use cases.

High Performance on Consumer Hardware

PEFT methods like LoRA enable fine-tuning large models that would otherwise be impossible on consumer-grade GPUs due to memory constraints. For instance, a 12B parameter model that would cause an Out-Of-Memory error on an 80GB GPU can be fine-tuned with LoRA, requiring only 56GB GPU memory. Furthermore, PEFT models often achieve performance comparable to fully fine-tuned models at a fraction of the GPU memory.

Quantization

Quantization is another technique to reduce model memory requirements by representing data in lower precision. PEFT methods can be combined with quantization to further simplify the training and loading of Large Language Models (LLMs) for inference, even on hardware with limited resources.

Save Compute and Storage

By fine-tuning only a small fraction of a model's parameters, PEFT helps save substantial storage. Each PEFT adapter checkpoint is typically only a few megabytes in size, compared to gigabytes for fully fine-tuned models. These smaller adapters demonstrate performance comparable to their fully fine-tuned counterparts, allowing for efficient adaptation across many datasets without concerns about catastrophic forgetting or overfitting the base model.

PEFT Integrations

PEFT is widely supported across the Hugging Face ecosystem due to its efficiency benefits:

Diffusers: Reduces memory requirements for training iterative diffusion processes, such as Stable Diffusion models with LoRA, resulting in significantly smaller checkpoints.
Transformers: Directly integrated, allowing users to easily add, load, and switch between different PEFT adapters on Transformers models.
Accelerate: Works out-of-the-box with Accelerate, simplifying distributed training and inference for very large models across various hardware setups.
TRL: Can be applied to training LLMs with Reinforcement Learning from Human Feedback (RLHF) components, including rankers and policies, enabling advanced fine-tuning techniques.