big_vision: Google Research's Codebase for Large-Scale Vision Models

big_vision: Google Research's Codebase for Large-Scale Vision Models

Summary

big_vision is Google Research's official codebase for training large-scale vision models using Jax/Flax. It has been instrumental in developing prominent architectures like Vision Transformer, SigLIP, and MLP-Mixer. This repository offers a robust starting point for researchers to conduct scalable vision experiments on GPUs and Cloud TPUs, scaling seamlessly from single cores to distributed setups.

Repository Info

Updated on December 31, 2025
View on GitHub

Introduction

big_vision is the official codebase from Google Research, designed for training large-scale vision models. Built upon the powerful Jax and Flax libraries, it leverages tf.data and TensorFlow Datasets for creating scalable and reproducible input pipelines. This repository serves two primary purposes: to publish the code for various research projects developed within Google, and to provide a robust starting point for researchers to conduct large-scale vision experiments. It supports seamless scaling from a single GPU core to distributed setups with up to 2048 Cloud TPU cores. Notable architectures developed using big_vision include Vision Transformer (ViT), SigLIP, MLP-Mixer, and LiT. While the codebase is dynamic, its core functionality is maintained, and it aims to support Google's internal research, with external contributions generally requiring pre-approval.

Installation

To get started with big_vision on a GPU machine, follow these steps. It is highly recommended to use a virtual environment for dependency management.

  1. Clone the Repository:

    git clone https://github.com/google-research/big_vision
    cd big_vision/
  2. Install Python Dependencies:

    pip3 install --upgrade pip
    pip3 install -r big_vision/requirements.txt
  3. Install JAX with CUDA Support:

    pip3 install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

    Note: You may need to adjust the JAX package based on your specific CUDA and cuDNN versions. Refer to the official JAX documentation for more details.

  4. Prepare TFDS Data:

    big_vision uses tensorflow_datasets for unified and reproducible access to standard datasets. It's recommended to prepare datasets separately before running experiments. For example, to download and preprocess cifar100, oxford_iiit_pet, and imagenet_v2:

    cd big_vision/
    python3 -m big_vision.tools.download_tfds_datasets cifar100 oxford_iiit_pet imagenet_v2

    Some datasets, like imagenet2012, require manual download of raw data files into $TFDS_DATA_DIR/downloads/manual/ before running the download_tfds_datasets command.

Examples

Once installed and data is prepared, you can run training jobs. Here are examples for GPU machines:

  • Train ViT-S/16 on ImageNet:

    python3 -m big_vision.train --config big_vision/configs/vit_s16_i1k.py --workdir workdirs/`date '+%m-%d_%H%M'`
  • Train MLP-Mixer-B/16 (with GPU-specific batch size):

    python3 -m big_vision.train --config big_vision/configs/mlp_mixer_i1k.py:gpu8 --workdir workdirs/`date '+%m-%d_%H%M'`

    The repository's README also provides detailed instructions and commands for running experiments on Google Cloud TPU VMs, including multi-host setups and FSDP training.

Why Use big_vision?

big_vision offers several compelling advantages for researchers and developers working with large-scale vision models:

  • Scalability: Designed for high-performance training, it scales effortlessly from single GPU machines to massive distributed setups on Cloud TPUs, supporting up to 2048 TPU cores.

  • Research Foundation: It is the foundational codebase for numerous cutting-edge research projects from Google, including Vision Transformer, MLP-Mixer, LiT, and SigLIP, providing battle-tested implementations.

  • Robustness: Training jobs are robust to interruptions, capable of seamlessly resuming from the last saved checkpoint, ensuring reliability for long-running experiments.

  • Powerful Configuration System: Features a flexible configuration system that allows for easy customization and extension of training parameters and modules.

  • Reproducibility: Utilizes tf.data and TensorFlow Datasets to ensure scalable and reproducible input pipelines, crucial for scientific research.

Links