torchchat: Run PyTorch LLMs Locally on Servers, Desktop, and Mobile

Introduction

torchchat is a powerful, PyTorch-native codebase that demonstrates how to run large language models (LLMs) efficiently and locally. It supports a wide range of deployment scenarios, from Python environments on servers and desktops to integrated C/C++ applications, and even directly on mobile platforms like iOS and Android. The project emphasizes seamless execution and performance, making it an excellent resource for developers looking to deploy LLMs in diverse settings.

While torchchat is no longer under active development, it continues to serve as a comprehensive showcase for running LLMs everywhere. Recent updates included support for DeepSeek R1 Distill: 8B and multimodal capabilities for Llama3.2 11B, highlighting its advanced features and broad model compatibility.

Installation

To get started with torchchat, you'll need Python 3.10 installed. It's highly recommended to use a virtual environment to manage dependencies.

1. Clone the repository and set up a virtual environment:

git clone https://github.com/pytorch/torchchat.git
cd torchchat
python3 -m venv .venv
source .venv/bin/activate
./install/install_requirements.sh
mkdir exportedModels

2. Log into Hugging Face and download a model:

Most models are distributed via Hugging Face. You'll need an account and a user access token with the write role.

huggingface-cli login

Then, list available models and download one, for example, llama3.1:

python3 torchchat.py list
python3 torchchat.py download llama3.1

Note: Some models may require requesting access via Hugging Face before downloading.

Examples

torchchat provides various commands for interacting with LLMs, from interactive chat to generating text and serving models via a REST API.

Chat

Engage in an interactive conversation with a downloaded LLM:

python3 torchchat.py chat llama3.1

Generate

Generate text based on a specific prompt:

python3 torchchat.py generate llama3.1 --prompt "write me a story about a boy and his bear"

Server

Host a local REST API server for model interaction, following the OpenAI API specification for chat completions. You'll need two terminals: one to start the server and another to query it.

Terminal 1 (Start Server):

python3 torchchat.py server llama3.1

Terminal 2 (Query Server):

curl http://127.0.0.1:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1",
    "stream": "true",
    "max_tokens": 200,
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

Browser

Launch a basic browser interface for local chat, which queries a local server. First, start the server as shown above, then in another terminal:

streamlit run torchchat/usages/browser.py

Desktop/Server Execution with AOT Inductor

For faster inference, you can compile models using AOT Inductor (AOTI), which creates a zipped PT2 file. This can be run in both Python and C++ environments.

Export the model:

python3 torchchat.py export llama3.1 --output-aoti-package-path exportedModels/llama3_1_artifacts.pt2

Run in Python:

python3 torchchat.py generate llama3.1 --aoti-package-path exportedModels/llama3_1_artifacts.pt2 --prompt "Hello my name is"

Mobile Execution with ExecuTorch

ExecuTorch optimizes models for execution on mobile or embedded devices. After setting up ExecuTorch (refer to the official repository for detailed steps), you can export and run models.

Export for mobile:

python3 torchchat.py export llama3.1 --quantize torchchat/quant_config/mobile.json --output-pte-path llama3.1.pte

This creates a .pte artifact that can be deployed on iOS or Android devices.

Why Use It

torchchat stands out for its commitment to PyTorch's design philosophy, prioritizing usability and native integration. It offers:

Local LLM Execution: Run powerful language models directly on your hardware, ensuring data privacy and reducing latency.
Cross-Platform Compatibility: Deploy models on Linux, macOS (M1/M2/M3), Android, and iOS, covering a broad spectrum of devices.
PyTorch-Native Performance: Leverages PyTorch's capabilities for efficient execution, including eager mode, AOT Inductor, and ExecuTorch for optimized inference.
Flexibility: Supports multiple data types (float32, float16, bfloat16) and various quantization schemes to balance performance and model size.
Simplicity and Extensibility: Designed with modular building blocks, favoring composition and clarity, making it easy to understand, use, and extend for custom applications.
Rich Model Support: Compatible with popular LLMs like Llama 3, Llama 2, Mistral, CodeLlama, and more, including multimodal variants.

torchchat: Run PyTorch LLMs Locally on Servers, Desktop, and Mobile

Summary

Repository Information

Topics

Use at your own risk

Introduction

Installation

Examples

Chat

Generate

Server

Browser

Desktop/Server Execution with AOT Inductor

Mobile Execution with ExecuTorch

Why Use It

Links

Related repositories

Evidently: Open-Source ML and LLM Observability Framework

Guardrails: Enhancing LLM Reliability and Structured Data Generation

Loop Engineering: Orchestrating AI Agents with Practical Patterns and Tools

MarkLLM: An Open-Source Toolkit for LLM Watermarking

Source repository