Repository History
Explore all analyzed open source repositories

big_vision: Google Research's Codebase for Large-Scale Vision Models
big_vision is Google Research's official codebase for training large-scale vision models using Jax/Flax. It has been instrumental in developing prominent architectures like Vision Transformer, SigLIP, and MLP-Mixer. This repository offers a robust starting point for researchers to conduct scalable vision experiments on GPUs and Cloud TPUs, scaling seamlessly from single cores to distributed setups.
HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation
HunyuanVideo-Avatar is a cutting-edge project by Tencent-Hunyuan for high-fidelity, audio-driven human animation. Utilizing a multimodal diffusion transformer, it generates dynamic, emotion-controllable, and multi-character dialogue videos. This innovative system addresses critical challenges in character consistency, emotion alignment, and multi-character animation, making it suitable for diverse applications like e-commerce and social media.

OmniParser: A Vision-Based Tool for GUI Agent Screen Parsing
OmniParser is a comprehensive tool developed by Microsoft for parsing user interface screenshots into structured, understandable elements. It significantly enhances the ability of vision-based models, such as GPT-4V, to generate accurate actions grounded in specific regions of a GUI. This project aims to advance pure vision-based GUI agents by providing robust screen parsing capabilities.

CineScale: Unlocking 4K High-Resolution Cinematic Video Generation
CineScale is an innovative GitHub repository by Eyeline-Labs, extending FreeScale to enable high-resolution cinematic video generation. It provides models and tools to achieve up to 4K video output, leveraging diffusion models for advanced visual content creation. This project offers a robust framework for researchers and developers to generate stunning, high-definition videos.

StreamDiffusion: Real-Time Interactive Generation with Diffusion Pipelines
StreamDiffusion is an innovative diffusion pipeline designed for real-time interactive generation, significantly enhancing the performance of current diffusion-based image generation techniques. It offers a pipeline-level solution to achieve high-speed image and text-to-image generation, making interactive AI experiences more accessible. This project introduces several key features to optimize computational efficiency and GPU utilization.

SyncTalk: High-Quality Talking Head Synthesis from CVPR 2024
SyncTalk is the official repository for a CVPR 2024 paper on talking head synthesis. This project focuses on generating highly synchronized lip movements, facial expressions, and stable head poses, while also restoring hair details for high-resolution video output. It leverages tri-plane hash representations to maintain subject identity effectively.
VGGT: Visual Geometry Grounded Transformer for Rapid 3D Scene Reconstruction
VGGT, the recipient of the CVPR 2025 Best Paper Award, is a Visual Geometry Grounded Transformer developed by Facebook AI and the Visual Geometry Group at Oxford. This innovative feed-forward neural network efficiently infers key 3D scene attributes, including camera parameters, depth maps, and 3D point tracks, from single or multiple images within seconds. It offers a powerful solution for rapid 3D reconstruction and scene understanding.
deepface: Lightweight Face Recognition and Facial Attribute Analysis Library
deepface is a powerful yet lightweight Python library for face recognition and facial attribute analysis. It offers capabilities for age, gender, emotion, and race prediction, wrapping state-of-the-art models for robust performance. Developers can easily integrate advanced facial analysis into their applications with just a few lines of code.
audio2photoreal: Synthesizing Photorealistic Codec Avatars from Audio
audio2photoreal is a powerful GitHub repository from Facebook Research that provides code and a dataset for generating photorealistic Codec Avatars driven solely from audio input. This project enables the synthesis of human embodiment in conversations, offering tools for training, testing, and running pretrained models to create lifelike digital representations. It represents a significant advancement in AI-driven computer graphics and virtual reality.
LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control
LivePortrait is an official PyTorch implementation for efficient portrait animation, bringing still images and videos to life with advanced stitching and retargeting control. It supports both human and animal subjects, offering various features like image-driven mode, regional control, and precise editing. Widely adopted by major video platforms, LivePortrait provides a robust solution for generating dynamic animated portraits.

Leffa: Controllable Person Image Generation with Flow Fields in Attention
Leffa is a unified framework for controllable person image generation, enabling precise manipulation of appearance through virtual try-on and pose via pose transfer. This project addresses the common issue of fine-grained textural detail distortion by learning flow fields in attention, guiding target queries to correct reference keys. It achieves state-of-the-art performance, maintaining high image quality while significantly reducing detail distortion.