torchchat: Run PyTorch LLMs Locally on Servers, Desktop, and Mobile
This repository profile is provided by osrepos.com, an open source repository discovery platform.

Summary
torchchat is a PyTorch-native codebase designed to showcase the ability to run large language models (LLMs) seamlessly across various platforms. It enables local execution of LLMs using Python, within C/C++ applications on desktop or servers, and directly on iOS and Android devices. Although no longer under active development, it remains a valuable resource for understanding and implementing local LLM deployment strategies.
Repository Information
Topics
Click on any tag to explore related repositories
Use at your own risk
OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.
Introduction
torchchat is a powerful, PyTorch-native codebase that demonstrates how to run large language models (LLMs) efficiently and locally. It supports a wide range of deployment scenarios, from Python environments on servers and desktops to integrated C/C++ applications, and even directly on mobile platforms like iOS and Android. The project emphasizes seamless execution and performance, making it an excellent resource for developers looking to deploy LLMs in diverse settings.
While torchchat is no longer under active development, it continues to serve as a comprehensive showcase for running LLMs everywhere. Recent updates included support for DeepSeek R1 Distill: 8B and multimodal capabilities for Llama3.2 11B, highlighting its advanced features and broad model compatibility.
Installation
To get started with torchchat, you'll need Python 3.10 installed. It's highly recommended to use a virtual environment to manage dependencies.
1. Clone the repository and set up a virtual environment:
git clone https://github.com/pytorch/torchchat.git
cd torchchat
python3 -m venv .venv
source .venv/bin/activate
./install/install_requirements.sh
mkdir exportedModels
2. Log into Hugging Face and download a model:
Most models are distributed via Hugging Face. You'll need an account and a user access token with the write role.
huggingface-cli login
Then, list available models and download one, for example, llama3.1:
python3 torchchat.py list
python3 torchchat.py download llama3.1
Note: Some models may require requesting access via Hugging Face before downloading.
Examples
torchchat provides various commands for interacting with LLMs, from interactive chat to generating text and serving models via a REST API.
Chat
Engage in an interactive conversation with a downloaded LLM:
python3 torchchat.py chat llama3.1
Generate
Generate text based on a specific prompt:
python3 torchchat.py generate llama3.1 --prompt "write me a story about a boy and his bear"
Server
Host a local REST API server for model interaction, following the OpenAI API specification for chat completions. You'll need two terminals: one to start the server and another to query it.
Terminal 1 (Start Server):
python3 torchchat.py server llama3.1
Terminal 2 (Query Server):
curl http://127.0.0.1:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1",
"stream": "true",
"max_tokens": 200,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}'
Browser
Launch a basic browser interface for local chat, which queries a local server. First, start the server as shown above, then in another terminal:
streamlit run torchchat/usages/browser.py
Desktop/Server Execution with AOT Inductor
For faster inference, you can compile models using AOT Inductor (AOTI), which creates a zipped PT2 file. This can be run in both Python and C++ environments.
Export the model:
python3 torchchat.py export llama3.1 --output-aoti-package-path exportedModels/llama3_1_artifacts.pt2
Run in Python:
python3 torchchat.py generate llama3.1 --aoti-package-path exportedModels/llama3_1_artifacts.pt2 --prompt "Hello my name is"
Mobile Execution with ExecuTorch
ExecuTorch optimizes models for execution on mobile or embedded devices. After setting up ExecuTorch (refer to the official repository for detailed steps), you can export and run models.
Export for mobile:
python3 torchchat.py export llama3.1 --quantize torchchat/quant_config/mobile.json --output-pte-path llama3.1.pte
This creates a .pte artifact that can be deployed on iOS or Android devices.
Why Use It
torchchat stands out for its commitment to PyTorch's design philosophy, prioritizing usability and native integration. It offers:
- Local LLM Execution: Run powerful language models directly on your hardware, ensuring data privacy and reducing latency.
- Cross-Platform Compatibility: Deploy models on Linux, macOS (M1/M2/M3), Android, and iOS, covering a broad spectrum of devices.
- PyTorch-Native Performance: Leverages PyTorch's capabilities for efficient execution, including eager mode, AOT Inductor, and ExecuTorch for optimized inference.
- Flexibility: Supports multiple data types (float32, float16, bfloat16) and various quantization schemes to balance performance and model size.
- Simplicity and Extensibility: Designed with modular building blocks, favoring composition and clarity, making it easy to understand, use, and extend for custom applications.
- Rich Model Support: Compatible with popular LLMs like Llama 3, Llama 2, Mistral, CodeLlama, and more, including multimodal variants.
Links
- GitHub Repository: https://github.com/pytorch/torchchat
- Hugging Face Token Documentation: https://huggingface.co/docs/hub/en/security-tokens
- PyTorch AOT Inductor Blog: https://pytorch.org/blog/pytorch2-2/
- ExecuTorch GitHub: https://github.com/pytorch/executorch
- torchchat Discord: https://discord.gg/hm2Keduk3v
Related repositories
Similar repositories that may be relevant next.
Evidently: Open-Source ML and LLM Observability Framework
June 30, 2026
Evidently is an open-source Python library designed for evaluating, testing, and monitoring machine learning and large language model systems. It provides over 100 built-in metrics for various tasks, from data drift detection to LLM judges, supporting both tabular and text data. This framework helps ensure the quality and performance of AI-powered systems throughout their lifecycle.

Guardrails: Enhancing LLM Reliability and Structured Data Generation
June 26, 2026
Guardrails is a Python framework designed to build reliable AI applications by adding guardrails to large language models. It helps detect, quantify, and mitigate risks in LLM inputs/outputs, and facilitates the generation of structured data. This framework ensures more predictable and safer interactions with AI models.

Loop Engineering: Orchestrating AI Agents with Practical Patterns and Tools
June 25, 2026
Loop Engineering is a GitHub repository offering practical patterns, starters, and CLI tools for building robust AI coding agent systems. It shifts the focus from individual prompt crafting to designing control systems that orchestrate agents over time. This project empowers developers to create autonomous, iterative AI workflows for various development tasks.

MarkLLM: An Open-Source Toolkit for LLM Watermarking
June 23, 2026
MarkLLM is an open-source toolkit designed to simplify the research and application of watermarking technologies for large language models (LLMs). It offers a unified framework for implementing various watermarking algorithms, alongside robust visualization and comprehensive evaluation tools. This toolkit helps researchers and the broader community understand and assess the authenticity and origin of machine-generated text.
Source repository
Open the original repository on GitHub.