big_vision: Google Research's Codebase for Large-Scale Vision Models

Introduction

big_vision is the official codebase from Google Research, designed for training large-scale vision models. Built upon the powerful Jax and Flax libraries, it leverages tf.data and TensorFlow Datasets for creating scalable and reproducible input pipelines. This repository serves two primary purposes: to publish the code for various research projects developed within Google, and to provide a robust starting point for researchers to conduct large-scale vision experiments. It supports seamless scaling from a single GPU core to distributed setups with up to 2048 Cloud TPU cores. Notable architectures developed using big_vision include Vision Transformer (ViT), SigLIP, MLP-Mixer, and LiT. While the codebase is dynamic, its core functionality is maintained, and it aims to support Google's internal research, with external contributions generally requiring pre-approval.

Installation

To get started with big_vision on a GPU machine, follow these steps. It is highly recommended to use a virtual environment for dependency management.

Clone the Repository:

git clone https://github.com/google-research/big_vision
cd big_vision/

Install Python Dependencies:

pip3 install --upgrade pip
pip3 install -r big_vision/requirements.txt

Install JAX with CUDA Support:
```
pip3 install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
```
Note: You may need to adjust the JAX package based on your specific CUDA and cuDNN versions. Refer to the official JAX documentation for more details.
Prepare TFDS Data:

big_vision uses tensorflow_datasets for unified and reproducible access to standard datasets. It's recommended to prepare datasets separately before running experiments. For example, to download and preprocess cifar100, oxford_iiit_pet, and imagenet_v2:
```
cd big_vision/
python3 -m big_vision.tools.download_tfds_datasets cifar100 oxford_iiit_pet imagenet_v2
```
Some datasets, like imagenet2012, require manual download of raw data files into $TFDS_DATA_DIR/downloads/manual/ before running the download_tfds_datasets command.

Examples

Once installed and data is prepared, you can run training jobs. Here are examples for GPU machines:

Train ViT-S/16 on ImageNet:

python3 -m big_vision.train --config big_vision/configs/vit_s16_i1k.py --workdir workdirs/`date '+%m-%d_%H%M'`

Train MLP-Mixer-B/16 (with GPU-specific batch size):
```
python3 -m big_vision.train --config big_vision/configs/mlp_mixer_i1k.py:gpu8 --workdir workdirs/`date '+%m-%d_%H%M'`
```
The repository's README also provides detailed instructions and commands for running experiments on Google Cloud TPU VMs, including multi-host setups and FSDP training.

Why Use big_vision?

big_vision offers several compelling advantages for researchers and developers working with large-scale vision models:

Scalability: Designed for high-performance training, it scales effortlessly from single GPU machines to massive distributed setups on Cloud TPUs, supporting up to 2048 TPU cores.
Research Foundation: It is the foundational codebase for numerous cutting-edge research projects from Google, including Vision Transformer, MLP-Mixer, LiT, and SigLIP, providing battle-tested implementations.
Robustness: Training jobs are robust to interruptions, capable of seamlessly resuming from the last saved checkpoint, ensuring reliability for long-running experiments.
Powerful Configuration System: Features a flexible configuration system that allows for easy customization and extension of training parameters and modules.
Reproducibility: Utilizes tf.data and TensorFlow Datasets to ensure scalable and reproducible input pipelines, crucial for scientific research.

big_vision: Google Research's Codebase for Large-Scale Vision Models

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why Use big_vision?

Links