{"name":"LLM Compressor: Optimize LLMs for Deployment with vLLM","description":"LLM Compressor is a Transformers-compatible Python library designed to apply various compression algorithms to Large Language Models (LLMs). It enables optimized deployment, especially with vLLM, by offering a comprehensive set of quantization techniques for weights, activations, and KV Cache. This tool seamlessly integrates with Hugging Face models, making LLM optimization accessible and efficient.","github":"https://github.com/vllm-project/llm-compressor","url":"https://osrepos.com/repo/vllm-project-llm-compressor","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/vllm-project-llm-compressor","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/vllm-project-llm-compressor.md","json":"https://osrepos.com/repo/vllm-project-llm-compressor.json","topics":["compression","quantization","Python","LLM","AI","Machine Learning","Deep Learning","vLLM"],"keywords":["compression","quantization","Python","LLM","AI","Machine Learning","Deep Learning","vLLM"],"stars":null,"summary":"LLM Compressor is a Transformers-compatible Python library designed to apply various compression algorithms to Large Language Models (LLMs). It enables optimized deployment, especially with vLLM, by offering a comprehensive set of quantization techniques for weights, activations, and KV Cache. This tool seamlessly integrates with Hugging Face models, making LLM optimization accessible and efficient.","content":"## Introduction\n\nLLM Compressor is a powerful, Transformers-compatible Python library developed by the vLLM Project. It is designed to apply various compression algorithms to Large Language Models (LLMs), enabling their optimized deployment, particularly with vLLM. This library offers a comprehensive suite of quantization algorithms and transforms for weights, activations, KV Cache, and attention mechanisms.\n\nKey features include seamless integration with Hugging Face models and repositories, saving models in the `compressed-tensors` format compatible with vLLM, and robust support for DDP and disk offloading to compress very large models efficiently. For a deeper dive, read the official announcement blog [here](https://neuralmagic.com/blog/llm-compressor-is-here-faster-inference-with-vllm/).\n\n## Installation\n\nGetting started with LLM Compressor is straightforward. You can install it using pip:\n\nbash\npip install llmcompressor\n\n\n## Examples\n\nLLM Compressor provides extensive documentation and examples to guide users through the compression process. You can refer to the [step-by-step compression guide](https://docs.vllm.ai/projects/llm-compressor/en/latest/steps/choosing-model/) and [User Guides](https://docs.vllm.ai/projects/llm-compressor/en/latest/guides/entrypoints/) for detailed information.\n\nHere's a quick tour demonstrating how to quantize a model, for instance, `Qwen3-30B-A3B`, with FP8 weights and activations using the `Round-to-Nearest` algorithm:\n\n### Apply Quantization\n\npython\nfrom compressed_tensors.offload import dispatch_model\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nfrom llmcompressor import oneshot\nfrom llmcompressor.modifiers.quantization import QuantizationModifier\n\nMODEL_ID = \"Qwen/Qwen3-30B-A3B\"\n\n# Load model.\nmodel = AutoModelForCausalLM.from_pretrained(MODEL_ID)\ntokenizer = AutoTokenizer.from_pretrained(MODEL_ID)\n\n# Configure the quantization algorithm and scheme.\n# In this case, we:\n#   * quantize the weights to FP8 using RTN with block_size 128\n#   * quantize the activations dynamically to FP8 during inference\nrecipe = QuantizationModifier(\n    targets=\"Linear\",\n    scheme=\"FP8_BLOCK\",\n    ignore=[\"lm_head\", \"re:.*mlp.gate$\"],\n)\n\n# Apply quantization.\noneshot(model=model, recipe=recipe)\n\n# Confirm generations of the quantized model look sane.\nprint(\"========== SAMPLE GENERATION ==============\")\ndispatch_model(model)\ninput_ids = tokenizer(\"Hello my name is\", return_tensors=\"pt\").input_ids.to(\n    model.device\n)\noutput = model.generate(input_ids, max_new_tokens=20)\nprint(tokenizer.decode(output[0]))\nprint(\"===========================================\")\n\n# Save to disk in compressed-tensors format.\nSAVE_DIR = MODEL_ID.split(\"/\")[1] + \"-FP8-BLOCK\"\nmodel.save_pretrained(SAVE_DIR)\ntokenizer.save_pretrained(SAVE_DIR)\n\n\n### Inference with vLLM\n\nCheckpoints created by `llmcompressor` can be seamlessly loaded and run in `vLLM`:\n\nInstall `vLLM`:\n\nbash\npip install vllm\n\n\nRun inference:\n\npython\nfrom vllm import LLM\nmodel = LLM(\"Qwen/Qwen3-30B-A3B-FP8-BLOCK\")\noutput = model.generate(\"My name is\")\n\n\nThe library supports a wide array of quantization types and algorithms, including:\n*   **Weight and Activation Quantization**: Examples for `int8`, `fp8`, `MXFP8`, `fp4` (NVFP4, MXFP4), and `fp8` with `int4` weights.\n*   **Weight Only Quantization**: Examples for `fp4` (NVFP4, MXFP4), and `int4` using GPTQ, AWQ, or AutoRound.\n*   **Attention and KV Cache Quantization**: Examples for `fp8` and `NVFP4`.\n*   **Architecture-Specific Quantization**: Guides for MoE LLMs, Vision-Language Models, and Audio-Language Models.\n*   **Big Model Quantization Support**: Techniques like sequential onloading and disk offloading for very large models.\n\n## Why Use LLM Compressor?\n\nLLM Compressor offers significant advantages for anyone working with large language models:\n*   **Optimized Deployment**: Achieve faster inference and reduced memory footprint for LLMs, crucial for efficient deployment.\n*   **Comprehensive Algorithms**: Access a rich set of quantization algorithms, including Simple PTQ, GPTQ, AWQ, SmoothQuant, AutoRound, and Rotation-based methods, allowing flexibility to choose the best approach for your model.\n*   **Hugging Face Integration**: Seamlessly work with models from the Hugging Face ecosystem, simplifying the compression workflow.\n*   **vLLM Compatibility**: Generate checkpoints directly compatible with vLLM, ensuring smooth integration into high-performance inference pipelines.\n*   **Support for Diverse Models**: Quantize various model architectures, including Mixture-of-Experts (MoE), Vision-Language, and Audio-Language models, along with support for very large models through advanced techniques.\n\n## Links\n\n*   **GitHub Repository**: [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)\n*   **Official Documentation**: [LLM Compressor Docs](https://docs.vllm.ai/projects/llm-compressor/en/latest/)\n*   **Announcement Blog**: [LLM Compressor is Here! Faster Inference with vLLM](https://neuralmagic.com/blog/llm-compressor-is-here-faster-inference-with-vllm/)\n*   **vLLM Community Slack**: [Join vLLM Developers Slack](https://communityinviter.com/apps/vllm-dev/join-vllm-developers-slack)","metrics":{"detailViews":1,"githubClicks":1},"dates":{"published":null,"modified":"2026-07-04T08:27:58.000Z"}}