vLLM CLI: A Powerful Command-Line Interface for Serving LLMs with vLLM

Introduction

vLLM CLI is a powerful and intuitive command-line interface tool designed to simplify the process of serving Large Language Models (LLMs) using the vLLM library. It provides a comprehensive suite of features for managing, configuring, and monitoring your LLM inference servers, catering to both interactive use and automated scripting.

Key features include a rich interactive terminal mode, direct CLI commands for automation, automatic discovery of local models with HuggingFace and Ollama support, and flexible configuration profiles. Recent updates have introduced an experimental Multi-Model Proxy server for unified API access to multiple LLMs, hardware-optimized profiles for GPT-OSS models on NVIDIA GPUs, and a convenient shortcuts system for quick launches.

Installation

Important: vLLM Installation Notes
vLLM contains pre-compiled CUDA kernels that must precisely match your PyTorch version. Installing mismatched versions will lead to errors. vLLM CLI does not install vLLM or PyTorch by default.

Option 1: Install vLLM separately and then install vLLM CLI (Recommended)

# Install vLLM -- Skip this step if you have vLLM installed in your environment
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
# Or specify a backend: uv pip install vllm --torch-backend=cu128

# Install vLLM CLI
uv pip install --upgrade vllm-cli
uv run vllm-cli

# If you are using conda:
# Activate the environment you have vLLM installed in
pip install vllm-cli
vllm-cli

Option 2: Install vLLM CLI + vLLM

pip install vllm-cli[vllm]
vllm-cli

Prerequisites:

Python 3.9+
CUDA-compatible GPU (recommended)
vLLM package installed

Examples

Interactive Mode
Launch the menu-driven interface for easy navigation and management.

vllm-cli

Serve a Model
Quickly serve a specific model using a direct command.

vllm-cli serve --model openai/gpt-oss-20b

Use a Shortcut
Launch pre-configured model and profile combinations with a simple shortcut.

vllm-cli serve --shortcut my-model

Why Use vLLM CLI?

vLLM CLI streamlines the often complex process of deploying and managing LLMs with vLLM. It offers a user-friendly interactive terminal for easy configuration and monitoring, alongside powerful command-line options for automation. With features like automatic model discovery, real-time server monitoring, and optimized configuration profiles, it empowers users to efficiently serve various LLMs, including those from HuggingFace and Ollama, while ensuring optimal performance and resource utilization.

vLLM CLI: A Powerful Command-Line Interface for Serving LLMs with vLLM

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why Use vLLM CLI?

Links