LLM Compressor: Optimize LLMs for Deployment with vLLM
This repository profile is provided by osrepos.com, an open source repository discovery platform.

Summary
LLM Compressor is a Transformers-compatible Python library designed to apply various compression algorithms to Large Language Models (LLMs). It enables optimized deployment, especially with vLLM, by offering a comprehensive set of quantization techniques for weights, activations, and KV Cache. This tool seamlessly integrates with Hugging Face models, making LLM optimization accessible and efficient.
Repository Information
Topics
Click on any tag to explore related repositories
Use at your own risk
OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.
Introduction
LLM Compressor is a powerful, Transformers-compatible Python library developed by the vLLM Project. It is designed to apply various compression algorithms to Large Language Models (LLMs), enabling their optimized deployment, particularly with vLLM. This library offers a comprehensive suite of quantization algorithms and transforms for weights, activations, KV Cache, and attention mechanisms.
Key features include seamless integration with Hugging Face models and repositories, saving models in the compressed-tensors format compatible with vLLM, and robust support for DDP and disk offloading to compress very large models efficiently. For a deeper dive, read the official announcement blog here.
Installation
Getting started with LLM Compressor is straightforward. You can install it using pip:
pip install llmcompressor
Examples
LLM Compressor provides extensive documentation and examples to guide users through the compression process. You can refer to the step-by-step compression guide and User Guides for detailed information.
Here's a quick tour demonstrating how to quantize a model, for instance, Qwen3-30B-A3B, with FP8 weights and activations using the Round-to-Nearest algorithm:
Apply Quantization
from compressed_tensors.offload import dispatch_model
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
MODEL_ID = "Qwen/Qwen3-30B-A3B"
# Load model.
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Configure the quantization algorithm and scheme.
# In this case, we:
# * quantize the weights to FP8 using RTN with block_size 128
# * quantize the activations dynamically to FP8 during inference
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_BLOCK",
ignore=["lm_head", "re:.*mlp.gate$"],
)
# Apply quantization.
oneshot(model=model, recipe=recipe)
# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
dispatch_model(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
model.device
)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("===========================================")
# Save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-BLOCK"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)
Inference with vLLM
Checkpoints created by llmcompressor can be seamlessly loaded and run in vLLM:
Install vLLM:
pip install vllm
Run inference:
from vllm import LLM
model = LLM("Qwen/Qwen3-30B-A3B-FP8-BLOCK")
output = model.generate("My name is")
The library supports a wide array of quantization types and algorithms, including:
- Weight and Activation Quantization: Examples for
int8,fp8,MXFP8,fp4(NVFP4, MXFP4), andfp8withint4weights. - Weight Only Quantization: Examples for
fp4(NVFP4, MXFP4), andint4using GPTQ, AWQ, or AutoRound. - Attention and KV Cache Quantization: Examples for
fp8andNVFP4. - Architecture-Specific Quantization: Guides for MoE LLMs, Vision-Language Models, and Audio-Language Models.
- Big Model Quantization Support: Techniques like sequential onloading and disk offloading for very large models.
Why Use LLM Compressor?
LLM Compressor offers significant advantages for anyone working with large language models:
- Optimized Deployment: Achieve faster inference and reduced memory footprint for LLMs, crucial for efficient deployment.
- Comprehensive Algorithms: Access a rich set of quantization algorithms, including Simple PTQ, GPTQ, AWQ, SmoothQuant, AutoRound, and Rotation-based methods, allowing flexibility to choose the best approach for your model.
- Hugging Face Integration: Seamlessly work with models from the Hugging Face ecosystem, simplifying the compression workflow.
- vLLM Compatibility: Generate checkpoints directly compatible with vLLM, ensuring smooth integration into high-performance inference pipelines.
- Support for Diverse Models: Quantize various model architectures, including Mixture-of-Experts (MoE), Vision-Language, and Audio-Language models, along with support for very large models through advanced techniques.
Links
- GitHub Repository: vllm-project/llm-compressor
- Official Documentation: LLM Compressor Docs
- Announcement Blog: LLM Compressor is Here! Faster Inference with vLLM
- vLLM Community Slack: Join vLLM Developers Slack
Source repository
Open the original repository on GitHub.