VGGT: Visual Geometry Grounded Transformer for Rapid 3D Scene Reconstruction

Introduction

VGGT, the Visual Geometry Grounded Transformer, is an innovative feed-forward neural network developed by Facebook AI and the Visual Geometry Group at Oxford. Recognized with the prestigious CVPR 2025 Best Paper Award, VGGT is designed to directly infer all key 3D attributes of a scene, including extrinsic and intrinsic camera parameters, point maps, depth maps, and 3D point tracks. It achieves this remarkable feat from one, a few, or hundreds of views, all within seconds. The project recently updated its licensing to permit commercial use for a specific checkpoint, VGGT-1B-Commercial, making it accessible for a wider range of applications.

Installation

To get started with VGGT, clone the repository and install the necessary dependencies. Ensure you have torch, torchvision, numpy, Pillow, and huggingface_hub installed.

git clone https://github.com/facebookresearch/vggt.git
cd vggt
pip install -r requirements.txt

Alternatively, VGGT can be installed as a Python package. Refer to the official documentation for detailed instructions on package installation.

Examples

Using VGGT is straightforward. The model automatically downloads pretrained weights upon its first run. Here’s a quick example to predict 3D attributes from a set of images:

import torch
from vggt.models.vggt import VGGT
from vggt.utils.load_fn import load_and_preprocess_images

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16

# Initialize the model and load the pretrained weights.
model = VGGT.from_pretrained("facebook/VGGT-1B").to(device)

# Load and preprocess example images (replace with your own image paths)
image_names = ["path/to/imageA.png", "path/to/imageB.png", "path/to/imageC.png"]  
images = load_and_preprocess_images(image_names).to(device)

with torch.no_grad():
    with torch.cuda.amp.autocast(dtype=dtype):
        # Predict attributes including cameras, depth maps, and point maps.
        predictions = model(images)

VGGT also offers detailed usage for specific branches (camera, depth, point, track heads), interactive 3D visualization demos (Gradio and Viser), and the ability to export predictions to COLMAP format, which can then be used for Gaussian Splatting training. It also demonstrates surprisingly strong zero-shot single-view reconstruction capabilities.

Why Use VGGT

VGGT stands out for several compelling reasons:

Rapid 3D Reconstruction: It infers complex 3D scene attributes within seconds, significantly faster than many traditional methods.
Comprehensive Output: The model provides a full suite of 3D data, including camera parameters, depth maps, point maps, and 3D point tracks.
Versatile Input: It effectively processes scenes from a single image, a few images, or hundreds of views.
Award-Winning Research: Recognized with the CVPR 2025 Best Paper Award, highlighting its groundbreaking contributions to computer vision.
Seamless Integration: Supports direct export to COLMAP format, enabling easy integration with other 3D reconstruction and rendering pipelines like Gaussian Splatting.
Strong Zero-Shot Performance: Achieves competitive results in single-view reconstruction without explicit training for this task.
Commercial Use License: A dedicated checkpoint, VGGT-1B-Commercial, is available under a commercial-friendly license, expanding its applicability for businesses and developers.

VGGT: Visual Geometry Grounded Transformer for Rapid 3D Scene Reconstruction

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why Use VGGT

Links