audio2photoreal: Synthesizing Photorealistic Codec Avatars from Audio

Introduction

The audio2photoreal repository from Facebook Research presents a groundbreaking project focused on synthesizing photorealistic Codec Avatars directly from audio input. This PyTorch implementation provides the code and dataset necessary to generate lifelike human embodiment in conversational settings. It's a significant step towards creating highly realistic digital representations of people, offering comprehensive tools for training, testing, and running pretrained models. Researchers and developers can leverage this work to explore advanced applications in computer graphics, AI, and virtual reality.

Installation

To get started with audio2photoreal, follow these steps for a quick setup and demo run. Ensure you have CUDA 11.7 and gcc/++ 9.0 for PyTorch3D compatibility.

First, create a Conda environment and install the necessary components, which include environment configuration, rendering assets, prerequisite models, and pretrained models:

conda create --name a2p_env python=3.9
conda activate a2p_env
sh demo/install.sh

Once the installation is complete, you can run the interactive demo:

python -m demo.demo

This demo allows you to record audio and then render corresponding photorealistic videos.

Examples

The audio2photoreal project enables the generation of photorealistic avatars from audio. You can experiment with the provided demo or delve into generating face and body movements separately.

A quick way to experience the project is through its interactive demo, where you can record an audio clip and generate a video of a photorealistic avatar speaking and moving in sync with your voice.

For more advanced usage, you can generate face codes and body poses independently using the pretrained models. For instance, to generate face codes for a participant like PXB184:

python -m sample.generate --model_path checkpoints/diffusion/c1_face/model000155000.pt --num_samples 10 --num_repetitions 5 --timestep_respacing ddim500 --guidance_param 10.0

After generating face codes, you can then generate body poses, optionally combining them for a full photorealistic avatar visualization:

python -m sample.generate --model_path checkpoints/diffusion/c1_pose/model000340000.pt --resume_trans checkpoints/guide/c1_pose/checkpoints/iter-0100000.pt --num_samples 10 --num_repetitions 5 --timestep_respacing ddim500 --guidance_param 2.0 --face_codes ./checkpoints/diffusion/c1_face/samples_c1_face_000155000_seed10_/results.npy --pose_codes ./checkpoints/diffusion/c1_pose/samples_c1_pose_000340000_seed10_guide_iter-0100000.pt/results.npy --plot

For an immediate hands-on experience without local setup, try the official Colab demo.

Why Use audio2photoreal?

audio2photoreal stands at the forefront of AI research in photorealistic avatar generation. By providing a robust framework to synthesize human embodiment from audio, it opens up numerous possibilities:

Cutting-edge Research: It offers a solid foundation for researchers in computer vision, graphics, and AI to build upon and advance the state-of-the-art in digital human creation.
Realistic Digital Humans: The project's ability to create highly convincing avatars driven by speech has implications for virtual assistants, realistic video conferencing, and immersive virtual reality experiences.
Comprehensive Toolkit: With train and test code, pretrained models, and access to a dataset, it provides a complete ecosystem for both experimentation and development.
Open-Source Contribution: As a Facebook Research project, it contributes valuable open-source resources to the community, fostering innovation in the field.

audio2photoreal: Synthesizing Photorealistic Codec Avatars from Audio

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why Use audio2photoreal?

Links