PEFT: State-of-the-Art Parameter-Efficient Fine-Tuning

Summary
PEFT (Parameter-Efficient Fine-Tuning) is a cutting-edge library from Hugging Face designed to efficiently adapt large pretrained models for various downstream applications. It dramatically reduces computational and storage costs by fine-tuning only a small subset of model parameters. This approach enables achieving performance comparable to fully fine-tuned models, making advanced AI accessible on more modest hardware.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
PEFT, or Parameter-Efficient Fine-Tuning, is a state-of-the-art library developed by Hugging Face that provides methods for efficiently adapting large pretrained models to various downstream applications. Fine-tuning massive models is often prohibitively costly due to their scale, requiring significant computational and storage resources. PEFT addresses this challenge by enabling the adaptation of these models through fine-tuning only a small number of (extra) model parameters, rather than all of them. This approach significantly decreases computational and storage costs, making advanced AI techniques more accessible. Recent state-of-the-art PEFT techniques achieve performance comparable to fully fine-tuned models.
PEFT is seamlessly integrated with popular libraries like Transformers for easy model training and inference, Diffusers for conveniently managing different adapters, and Accelerate for distributed training and inference, even for very large models.
Installation
To get started with PEFT, you can easily install it using pip:
pip install peft
Examples
Here are quick examples demonstrating how to prepare a model for training with a PEFT method like LoRA, and how to load a PEFT model for inference.
Preparing a Model for Training
This example shows how to wrap a base model and PEFT configuration with get_peft_model. For the Qwen/Qwen2.5-3B-Instruct model, you're only training a tiny fraction of the parameters.
from transformers import AutoModelForCausalLM
from peft import LoraConfig, TaskType, get_peft_model
import torch
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
model_id = "Qwen/Qwen2.5-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
peft_config = LoraConfig(
r=16,
lora_alpha=32,
task_type=TaskType.CAUSAL_LM,
# target_modules=["q_proj", "v_proj", ...] # optionally indicate target modules
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# prints: trainable params: 3,686,400 || all params: 3,089,625,088 || trainable%: 0.1193
# now perform training on your dataset, e.g. using transformers Trainer, then save the model
model.save_pretrained("qwen2.5-3b-lora")
Loading a PEFT Model for Inference
To load a PEFT model for inference, you can use PeftModel.from_pretrained:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
model_id = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
model = PeftModel.from_pretrained(model, "qwen2.5-3b-lora")
inputs = tokenizer("Preheat the oven to 350 degrees and place the cookie dough", return_tensors="pt")
outputs = model.generate(**inputs.to(device), max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Why Use PEFT?
PEFT offers numerous benefits, primarily significant savings in compute and storage, making it applicable to a wide range of use cases.
High Performance on Consumer Hardware
PEFT methods like LoRA enable fine-tuning large models that would otherwise be impossible on consumer-grade GPUs due to memory constraints. For instance, a 12B parameter model that would cause an Out-Of-Memory error on an 80GB GPU can be fine-tuned with LoRA, requiring only 56GB GPU memory. Furthermore, PEFT models often achieve performance comparable to fully fine-tuned models at a fraction of the GPU memory.
Quantization
Quantization is another technique to reduce model memory requirements by representing data in lower precision. PEFT methods can be combined with quantization to further simplify the training and loading of Large Language Models (LLMs) for inference, even on hardware with limited resources.
Save Compute and Storage
By fine-tuning only a small fraction of a model's parameters, PEFT helps save substantial storage. Each PEFT adapter checkpoint is typically only a few megabytes in size, compared to gigabytes for fully fine-tuned models. These smaller adapters demonstrate performance comparable to their fully fine-tuned counterparts, allowing for efficient adaptation across many datasets without concerns about catastrophic forgetting or overfitting the base model.
PEFT Integrations
PEFT is widely supported across the Hugging Face ecosystem due to its efficiency benefits:
- Diffusers: Reduces memory requirements for training iterative diffusion processes, such as Stable Diffusion models with LoRA, resulting in significantly smaller checkpoints.
- Transformers: Directly integrated, allowing users to easily add, load, and switch between different PEFT adapters on Transformers models.
- Accelerate: Works out-of-the-box with Accelerate, simplifying distributed training and inference for very large models across various hardware setups.
- TRL: Can be applied to training LLMs with Reinforcement Learning from Human Feedback (RLHF) components, including rankers and policies, enabling advanced fine-tuning techniques.
Links
- GitHub Repository: https://github.com/huggingface/peft
- PEFT Documentation: https://huggingface.co/docs/peft/en/index
- Hugging Face PEFT Organization: https://huggingface.co/PEFT