{"name":"Text Generation Inference: High-Performance LLM Serving by Hugging Face","description":"Text Generation Inference (TGI) is a robust toolkit from Hugging Face designed for deploying and serving Large Language Models (LLMs) with high performance. It powers Hugging Face's production services, including Hugging Chat and their Inference API. TGI offers optimized text generation, supporting popular open-source LLMs and implementing advanced features for efficient and scalable inference.","github":"https://github.com/huggingface/text-generation-inference","url":"https://osrepos.com/repo/huggingface-text-generation-inference","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/huggingface-text-generation-inference","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/huggingface-text-generation-inference.md","json":"https://osrepos.com/repo/huggingface-text-generation-inference.json","topics":["deep-learning","inference","nlp","pytorch","transformer","Python","LLM","AI"],"keywords":["deep-learning","inference","nlp","pytorch","transformer","Python","LLM","AI"],"stars":null,"summary":"Text Generation Inference (TGI) is a robust toolkit from Hugging Face designed for deploying and serving Large Language Models (LLMs) with high performance. It powers Hugging Face's production services, including Hugging Chat and their Inference API. TGI offers optimized text generation, supporting popular open-source LLMs and implementing advanced features for efficient and scalable inference.","content":"## Introduction\n\nText Generation Inference (TGI) is an open-source toolkit developed by Hugging Face for deploying and serving Large Language Models (LLMs) efficiently. This powerful solution is engineered to provide high-performance text generation for a wide array of popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, and GPT-NeoX. TGI is not just a research project, it's a production-ready system, actively used by Hugging Face to power critical services like Hugging Chat, the Inference API, and Inference Endpoints.\n\n## Installation\n\nThe easiest way to get started with Text Generation Inference is by using its official Docker container. This method simplifies dependency management and ensures a consistent environment.\n\nTo run TGI with a model like `HuggingFaceH4/zephyr-7b-beta` using Docker and NVIDIA GPUs, execute the following commands:\n\nshell\nmodel=HuggingFaceH4/zephyr-7b-beta\nvolume=$PWD/data\ndocker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \\\n    ghcr.io/huggingface/text-generation-inference:3.3.5 --model-id $model\n\n\nFor detailed instructions on local installation, including Rust and Python virtual environments, please refer to the official GitHub repository's documentation.\n\n## Examples\n\nOnce TGI is running, you can interact with it via its REST API. Here are examples using `curl` to generate text.\n\n**Basic Text Generation:**\nbash\ncurl 127.0.0.1:8080/generate_stream \\\n    -X POST \\\n    -d '{\"inputs\":\"What is Deep Learning?\",\"parameters\":{\"max_new_tokens\":20}}' \\\n    -H 'Content-Type: application/json'\n\n\n**Using the Messages API (OpenAI Chat Completion compatible):**\nbash\ncurl localhost:8080/v1/chat/completions \\\n    -X POST \\\n    -d '{\n  \"model\": \"tgi\",\n  \"messages\": [\n    {\n      \"role\": \"system\",\n      \"content\": \"You are a helpful assistant.\"\n    },\n    {\n      \"role\": \"user\",\n      \"content\": \"What is deep learning?\"\n    }\n  ],\n  \"stream\": true,\n  \"max_tokens\": 20\n}' \\\n    -H 'Content-Type: application/json'\n\n\nThe OpenAPI documentation for the REST API is available via the `/docs` route or at the Swagger UI link provided in the Links section.\n\n## Why Use It?\n\nText Generation Inference stands out for its comprehensive set of features designed for efficient and scalable LLM deployment:\n\n*   **High Performance:** TGI utilizes Tensor Parallelism for faster inference on multiple GPUs, continuous batching for increased throughput, and optimized transformer code with Flash Attention and Paged Attention.\n*   **Production Readiness:** It includes distributed tracing with OpenTelemetry and Prometheus metrics, making it suitable for production environments.\n*   **Broad Model Support:** It supports a wide range of popular open-source LLMs and offers simple launchers for easy deployment.\n*   **Quantization:** TGI supports various quantization techniques, including bitsandbytes, GPT-Q, AWQ, and fp8, to reduce VRAM requirements and improve inference speed.\n*   **Flexible API:** It provides a simple REST API for text generation and a Messages API compatible with the OpenAI Chat Completion API.\n*   **Hardware Agnostic:** TGI offers support for a variety of hardware, including Nvidia, AMD, Inferentia, Intel GPU, Gaudi, and Google TPU.\n\n## Links\n\nFor more information and to contribute to the project, please visit the official resources:\n\n*   [GitHub Repository](https://github.com/huggingface/text-generation-inference)\n*   [Swagger API Documentation](https://huggingface.github.io/text-generation-inference)\n*   [Hugging Face TGI Documentation](https://huggingface.co/docs/text-generation-inference)\n*   [LLM inference at scale with TGI (Adyen Blog Post)](https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi)","metrics":{"detailViews":4,"githubClicks":6},"dates":{"published":null,"modified":"2025-11-04T12:01:22.000Z"}}