big_vision: Google Research's Codebase for Large-Scale Vision Models

Summary
big_vision is Google Research's official codebase for training large-scale vision models using Jax/Flax. It has been instrumental in developing prominent architectures like Vision Transformer, SigLIP, and MLP-Mixer. This repository offers a robust starting point for researchers to conduct scalable vision experiments on GPUs and Cloud TPUs, scaling seamlessly from single cores to distributed setups.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
big_vision is the official codebase from Google Research, designed for training large-scale vision models. Built upon the powerful Jax and Flax libraries, it leverages tf.data and TensorFlow Datasets for creating scalable and reproducible input pipelines. This repository serves two primary purposes: to publish the code for various research projects developed within Google, and to provide a robust starting point for researchers to conduct large-scale vision experiments. It supports seamless scaling from a single GPU core to distributed setups with up to 2048 Cloud TPU cores. Notable architectures developed using big_vision include Vision Transformer (ViT), SigLIP, MLP-Mixer, and LiT. While the codebase is dynamic, its core functionality is maintained, and it aims to support Google's internal research, with external contributions generally requiring pre-approval.
Installation
To get started with big_vision on a GPU machine, follow these steps. It is highly recommended to use a virtual environment for dependency management.
Clone the Repository:
git clone https://github.com/google-research/big_vision cd big_vision/Install Python Dependencies:
pip3 install --upgrade pip pip3 install -r big_vision/requirements.txtInstall JAX with CUDA Support:
pip3 install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.htmlNote: You may need to adjust the JAX package based on your specific CUDA and cuDNN versions. Refer to the official JAX documentation for more details.
Prepare TFDS Data:
big_vision uses
tensorflow_datasetsfor unified and reproducible access to standard datasets. It's recommended to prepare datasets separately before running experiments. For example, to download and preprocesscifar100,oxford_iiit_pet, andimagenet_v2:cd big_vision/ python3 -m big_vision.tools.download_tfds_datasets cifar100 oxford_iiit_pet imagenet_v2Some datasets, like
imagenet2012, require manual download of raw data files into$TFDS_DATA_DIR/downloads/manual/before running thedownload_tfds_datasetscommand.
Examples
Once installed and data is prepared, you can run training jobs. Here are examples for GPU machines:
Train ViT-S/16 on ImageNet:
python3 -m big_vision.train --config big_vision/configs/vit_s16_i1k.py --workdir workdirs/`date '+%m-%d_%H%M'`Train MLP-Mixer-B/16 (with GPU-specific batch size):
python3 -m big_vision.train --config big_vision/configs/mlp_mixer_i1k.py:gpu8 --workdir workdirs/`date '+%m-%d_%H%M'`The repository's README also provides detailed instructions and commands for running experiments on Google Cloud TPU VMs, including multi-host setups and FSDP training.
Why Use big_vision?
big_vision offers several compelling advantages for researchers and developers working with large-scale vision models:
Scalability: Designed for high-performance training, it scales effortlessly from single GPU machines to massive distributed setups on Cloud TPUs, supporting up to 2048 TPU cores.
Research Foundation: It is the foundational codebase for numerous cutting-edge research projects from Google, including Vision Transformer, MLP-Mixer, LiT, and SigLIP, providing battle-tested implementations.
Robustness: Training jobs are robust to interruptions, capable of seamlessly resuming from the last saved checkpoint, ensuring reliability for long-running experiments.
Powerful Configuration System: Features a flexible configuration system that allows for easy customization and extension of training parameters and modules.
Reproducibility: Utilizes
tf.dataandTensorFlow Datasetsto ensure scalable and reproducible input pipelines, crucial for scientific research.
Links
GitHub Repository: https://github.com/google-research/big_vision
ViT Baseline Paper: Better plain ViT baselines for ImageNet-1k
JAX Documentation: https://github.com/jax-ml/jax#pip-installation-gpu-cuda
TensorFlow Datasets Catalog: https://www.tensorflow.org/datasets/catalog/overview#all_datasets