LLM Compressor: Optimize LLMs for Deployment with vLLM

This repository profile is provided by osrepos.com, an open source repository discovery platform.

LLM Compressor: Optimize LLMs for Deployment with vLLM

Summary

LLM Compressor is a Transformers-compatible Python library designed to apply various compression algorithms to Large Language Models (LLMs). It enables optimized deployment, especially with vLLM, by offering a comprehensive set of quantization techniques for weights, activations, and KV Cache. This tool seamlessly integrates with Hugging Face models, making LLM optimization accessible and efficient.

Repository Information

Analyzed by OSRepos on July 4, 2026

Topics

Click on any tag to explore related repositories

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

LLM Compressor is a powerful, Transformers-compatible Python library developed by the vLLM Project. It is designed to apply various compression algorithms to Large Language Models (LLMs), enabling their optimized deployment, particularly with vLLM. This library offers a comprehensive suite of quantization algorithms and transforms for weights, activations, KV Cache, and attention mechanisms.

Key features include seamless integration with Hugging Face models and repositories, saving models in the compressed-tensors format compatible with vLLM, and robust support for DDP and disk offloading to compress very large models efficiently. For a deeper dive, read the official announcement blog here.

Installation

Getting started with LLM Compressor is straightforward. You can install it using pip:

pip install llmcompressor

Examples

LLM Compressor provides extensive documentation and examples to guide users through the compression process. You can refer to the step-by-step compression guide and User Guides for detailed information.

Here's a quick tour demonstrating how to quantize a model, for instance, Qwen3-30B-A3B, with FP8 weights and activations using the Round-to-Nearest algorithm:

Apply Quantization

from compressed_tensors.offload import dispatch_model
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "Qwen/Qwen3-30B-A3B"

# Load model.
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to FP8 using RTN with block_size 128
#   * quantize the activations dynamically to FP8 during inference
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_BLOCK",
    ignore=["lm_head", "re:.*mlp.gate$"],
)

# Apply quantization.
oneshot(model=model, recipe=recipe)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
dispatch_model(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
    model.device
)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("===========================================")

# Save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-BLOCK"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

Inference with vLLM

Checkpoints created by llmcompressor can be seamlessly loaded and run in vLLM:

Install vLLM:

pip install vllm

Run inference:

from vllm import LLM
model = LLM("Qwen/Qwen3-30B-A3B-FP8-BLOCK")
output = model.generate("My name is")

The library supports a wide array of quantization types and algorithms, including:

  • Weight and Activation Quantization: Examples for int8, fp8, MXFP8, fp4 (NVFP4, MXFP4), and fp8 with int4 weights.
  • Weight Only Quantization: Examples for fp4 (NVFP4, MXFP4), and int4 using GPTQ, AWQ, or AutoRound.
  • Attention and KV Cache Quantization: Examples for fp8 and NVFP4.
  • Architecture-Specific Quantization: Guides for MoE LLMs, Vision-Language Models, and Audio-Language Models.
  • Big Model Quantization Support: Techniques like sequential onloading and disk offloading for very large models.

Why Use LLM Compressor?

LLM Compressor offers significant advantages for anyone working with large language models:

  • Optimized Deployment: Achieve faster inference and reduced memory footprint for LLMs, crucial for efficient deployment.
  • Comprehensive Algorithms: Access a rich set of quantization algorithms, including Simple PTQ, GPTQ, AWQ, SmoothQuant, AutoRound, and Rotation-based methods, allowing flexibility to choose the best approach for your model.
  • Hugging Face Integration: Seamlessly work with models from the Hugging Face ecosystem, simplifying the compression workflow.
  • vLLM Compatibility: Generate checkpoints directly compatible with vLLM, ensuring smooth integration into high-performance inference pipelines.
  • Support for Diverse Models: Quantize various model architectures, including Mixture-of-Experts (MoE), Vision-Language, and Audio-Language models, along with support for very large models through advanced techniques.

Links

Source repository

Open the original repository on GitHub.

View on GitHub
OS
OSRepos

Analysis and discovery of open source repositories. Find interesting projects and follow their updates.

Monitor your website with YourWebsiteScore

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of third-party repository code is at your own risk. Always review source code, dependencies, licenses, and security implications before running anything.

© 2025 OSRepos. Built with Nuxt 3 and lots of ❤️