LLM Compressor: Optimize LLMs for Deployment with vLLM

Introduction

LLM Compressor is a powerful, Transformers-compatible Python library developed by the vLLM Project. It is designed to apply various compression algorithms to Large Language Models (LLMs), enabling their optimized deployment, particularly with vLLM. This library offers a comprehensive suite of quantization algorithms and transforms for weights, activations, KV Cache, and attention mechanisms.

Key features include seamless integration with Hugging Face models and repositories, saving models in the compressed-tensors format compatible with vLLM, and robust support for DDP and disk offloading to compress very large models efficiently. For a deeper dive, read the official announcement blog here.

Installation

Getting started with LLM Compressor is straightforward. You can install it using pip:

pip install llmcompressor

Examples

LLM Compressor provides extensive documentation and examples to guide users through the compression process. You can refer to the step-by-step compression guide and User Guides for detailed information.

Here's a quick tour demonstrating how to quantize a model, for instance, Qwen3-30B-A3B, with FP8 weights and activations using the Round-to-Nearest algorithm:

Apply Quantization

from compressed_tensors.offload import dispatch_model
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "Qwen/Qwen3-30B-A3B"

# Load model.
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to FP8 using RTN with block_size 128
#   * quantize the activations dynamically to FP8 during inference
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_BLOCK",
    ignore=["lm_head", "re:.*mlp.gate$"],
)

# Apply quantization.
oneshot(model=model, recipe=recipe)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
dispatch_model(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
    model.device
)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("===========================================")

# Save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-BLOCK"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

Inference with vLLM

Checkpoints created by llmcompressor can be seamlessly loaded and run in vLLM:

Install vLLM:

pip install vllm

Run inference:

from vllm import LLM
model = LLM("Qwen/Qwen3-30B-A3B-FP8-BLOCK")
output = model.generate("My name is")

The library supports a wide array of quantization types and algorithms, including:

Weight and Activation Quantization: Examples for int8, fp8, MXFP8, fp4 (NVFP4, MXFP4), and fp8 with int4 weights.
Weight Only Quantization: Examples for fp4 (NVFP4, MXFP4), and int4 using GPTQ, AWQ, or AutoRound.
Attention and KV Cache Quantization: Examples for fp8 and NVFP4.
Architecture-Specific Quantization: Guides for MoE LLMs, Vision-Language Models, and Audio-Language Models.
Big Model Quantization Support: Techniques like sequential onloading and disk offloading for very large models.

Why Use LLM Compressor?

LLM Compressor offers significant advantages for anyone working with large language models:

Optimized Deployment: Achieve faster inference and reduced memory footprint for LLMs, crucial for efficient deployment.
Comprehensive Algorithms: Access a rich set of quantization algorithms, including Simple PTQ, GPTQ, AWQ, SmoothQuant, AutoRound, and Rotation-based methods, allowing flexibility to choose the best approach for your model.
Hugging Face Integration: Seamlessly work with models from the Hugging Face ecosystem, simplifying the compression workflow.
vLLM Compatibility: Generate checkpoints directly compatible with vLLM, ensuring smooth integration into high-performance inference pipelines.
Support for Diverse Models: Quantize various model architectures, including Mixture-of-Experts (MoE), Vision-Language, and Audio-Language models, along with support for very large models through advanced techniques.

LLM Compressor: Optimize LLMs for Deployment with vLLM

Summary

Repository Information

Topics

Use at your own risk