Text Generation Inference: High-Performance LLM Serving by Hugging Face

Introduction

Text Generation Inference (TGI) is an open-source toolkit developed by Hugging Face for deploying and serving Large Language Models (LLMs) efficiently. This powerful solution is engineered to provide high-performance text generation for a wide array of popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, and GPT-NeoX. TGI is not just a research project, it's a production-ready system, actively used by Hugging Face to power critical services like Hugging Chat, the Inference API, and Inference Endpoints.

Installation

The easiest way to get started with Text Generation Inference is by using its official Docker container. This method simplifies dependency management and ensures a consistent environment.

To run TGI with a model like HuggingFaceH4/zephyr-7b-beta using Docker and NVIDIA GPUs, execute the following commands:

model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.3.5 --model-id $model

For detailed instructions on local installation, including Rust and Python virtual environments, please refer to the official GitHub repository's documentation.

Examples

Once TGI is running, you can interact with it via its REST API. Here are examples using curl to generate text.

Basic Text Generation:

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

Using the Messages API (OpenAI Chat Completion compatible):

curl localhost:8080/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'

The OpenAPI documentation for the REST API is available via the /docs route or at the Swagger UI link provided in the Links section.

Why Use It?

Text Generation Inference stands out for its comprehensive set of features designed for efficient and scalable LLM deployment:

High Performance: TGI utilizes Tensor Parallelism for faster inference on multiple GPUs, continuous batching for increased throughput, and optimized transformer code with Flash Attention and Paged Attention.
Production Readiness: It includes distributed tracing with OpenTelemetry and Prometheus metrics, making it suitable for production environments.
Broad Model Support: It supports a wide range of popular open-source LLMs and offers simple launchers for easy deployment.
Quantization: TGI supports various quantization techniques, including bitsandbytes, GPT-Q, AWQ, and fp8, to reduce VRAM requirements and improve inference speed.
Flexible API: It provides a simple REST API for text generation and a Messages API compatible with the OpenAI Chat Completion API.
Hardware Agnostic: TGI offers support for a variety of hardware, including Nvidia, AMD, Inferentia, Intel GPU, Gaudi, and Google TPU.

Links

For more information and to contribute to the project, please visit the official resources:

Text Generation Inference: High-Performance LLM Serving by Hugging Face

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why Use It?

Links