Picotron: Minimalistic 4D-Parallelism Framework for LLM Training Education

Introduction

Picotron, inspired by NanoGPT, is a minimalistic and highly hackable repository for pre-training Llama-like models. It implements 4D Parallelism (Data, Tensor, Pipeline, Context parallel) and is specifically designed for educational purposes, making it an excellent tool for learning and experimentation in distributed training. The codebase is simple and readable, with train.py, model.py, and parallelism files all under 300 lines of code. While performance is under active development, it has shown promising MFU results on LLaMA-2-7B and SmolLM-1.7B models.

Installation

To get started with Picotron, you can install it directly from the repository:

pip install -e .

Examples

Picotron provides quick start examples for both GPU and CPU environments, demonstrating how to configure and run training with different parallelism strategies.

First, obtain a Hugging Face token from Hugging Face settings to download models.

GPU Training

To create a configuration file and run training locally with 3D Parallelism:

# Create a config file (e.g., for Llama-1B with Data Parallelism)
python create_config.py --out_dir tmp --exp_name llama-1B --dp 8 --model_name HuggingFaceTB/SmolLM-1.7B --num_hidden_layers 15  --grad_acc_steps 32 --mbs 4 --seq_len 1024 --hf_token <HF_TOKEN>

# Run locally
torchrun --nproc_per_node 8 train.py --config tmp/llama-1B/config.json

# Example for 3D Parallelism (Data, Tensor, Pipeline)
python create_config.py --out_dir tmp --dp 4 --tp 2 --pp 2 --pp_engine 1f1b --exp_name llama-7B --model_name meta-llama/Llama-2-7b-hf  --grad_acc_steps 32 --mbs 4 --seq_len 1024 --hf_token <HF_TOKEN>

# Submit to Slurm
python submit_slurm_jobs.py --inp_dir tmp/llama-7B --qos high --hf_token <HF_TOKEN>

CPU Training (Expect slower performance)

You can also run Picotron on a CPU for experimentation, though it will be significantly slower:

# Create a config file for CPU training with 3D Parallelism
python create_config.py --out_dir tmp --exp_name llama-1B-cpu --dp 2 --tp 2 --pp 2 --pp_engine 1f1b --model_name HuggingFaceTB/SmolLM-1.7B --num_hidden_layers 5  --grad_acc_steps 2 --mbs 4 --seq_len 128 --hf_token <HF_TOKEN> --use_cpu

# Run locally on CPU
torchrun --nproc_per_node 8 train.py --config tmp/llama-1B-cpu/config.json

Why Use Picotron

Picotron stands out as an exceptional resource for anyone looking to understand the intricacies of distributed training for large language models. Its primary focus on education, combined with a minimalist and hackable design, allows users to quickly grasp complex concepts like 4D Parallelism without being overwhelmed by excessive code. Unlike more production-oriented frameworks, Picotron prioritizes clarity and learning, making it an ideal starting point for researchers and students to experiment and build their own distributed training setups from scratch.

Links

GitHub Repository: huggingface/picotron
Picotron Tutorial (Playlist): YouTube
Picotron Tutorial (Codebase): GitHub

Citation:

@misc{zhao2025picotron,
  author = {Haojun Zhao and Ferdinand Mom},
  title = {Picotron: Distributed training framework for education and research experimentation},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/picotron}}
}