Repository History
16 repositories tagged with Computer Vision

PyTorch Image Models (timm): The Ultimate Collection of Image Encoders
PyTorch Image Models (timm) is an extensive library offering the largest collection of PyTorch image encoders and backbones. It provides a wide array of state-of-the-art models, complete with pretrained weights, training, evaluation, and inference scripts. This makes it an invaluable resource for researchers and developers working with computer vision tasks in PyTorch.
CoTracker: A Powerful Model for Tracking Any Point on a Video
CoTracker is a state-of-the-art model developed by Facebook AI Research and the University of Oxford, designed for tracking any point (pixel) across video sequences. This transformer-based solution offers fast, accurate, and quasi-dense point tracking capabilities. It is an invaluable tool for researchers and developers in computer vision, enabling precise analysis of motion in videos.

WiFi-3D-Fusion: Real-Time 3D Human Pose Estimation from WiFi Signals
WiFi-3D-Fusion is an innovative open-source research project that leverages WiFi CSI signals and deep learning to estimate 3D human pose. It uniquely fuses wireless sensing with computer vision techniques, providing next-generation spatial awareness. This project offers real-time motion detection and visualization, showcasing a novel approach to understanding human movement in 3D space.

GigaSLAM: Large-Scale Monocular SLAM with Hierarchical Gaussian Splats
GigaSLAM is a groundbreaking monocular SLAM framework designed for kilometer-scale outdoor environments. It leverages hierarchical Gaussian splats and neural networks to achieve efficient, scalable mapping and high-fidelity rendering. This system addresses the challenges of large-scale tracking and mapping using only RGB input, extending the applicability of Gaussian Splatting SLAM to unbounded outdoor scenes.
MonoPCC: Photometric-invariant Cycle Constraint for Monocular Depth Estimation
MonoPCC is a PyTorch implementation for monocular depth estimation, specifically designed for endoscopic images using a photometric-invariant cycle constraint. This self-supervised learning approach aims to improve depth prediction accuracy in challenging medical imaging scenarios. It demonstrates state-of-the-art performance on datasets like SCARED and KITTI, and offers a plug-and-play design for integration into various backbone networks.

big_vision: Google Research's Codebase for Large-Scale Vision Models
big_vision is Google Research's official codebase for training large-scale vision models using Jax/Flax. It has been instrumental in developing prominent architectures like Vision Transformer, SigLIP, and MLP-Mixer. This repository offers a robust starting point for researchers to conduct scalable vision experiments on GPUs and Cloud TPUs, scaling seamlessly from single cores to distributed setups.
HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation
HunyuanVideo-Avatar is a cutting-edge project by Tencent-Hunyuan for high-fidelity, audio-driven human animation. Utilizing a multimodal diffusion transformer, it generates dynamic, emotion-controllable, and multi-character dialogue videos. This innovative system addresses critical challenges in character consistency, emotion alignment, and multi-character animation, making it suitable for diverse applications like e-commerce and social media.

OmniParser: A Vision-Based Tool for GUI Agent Screen Parsing
OmniParser is a comprehensive tool developed by Microsoft for parsing user interface screenshots into structured, understandable elements. It significantly enhances the ability of vision-based models, such as GPT-4V, to generate accurate actions grounded in specific regions of a GUI. This project aims to advance pure vision-based GUI agents by providing robust screen parsing capabilities.

CineScale: Unlocking 4K High-Resolution Cinematic Video Generation
CineScale is an innovative GitHub repository by Eyeline-Labs, extending FreeScale to enable high-resolution cinematic video generation. It provides models and tools to achieve up to 4K video output, leveraging diffusion models for advanced visual content creation. This project offers a robust framework for researchers and developers to generate stunning, high-definition videos.

StreamDiffusion: Real-Time Interactive Generation with Diffusion Pipelines
StreamDiffusion is an innovative diffusion pipeline designed for real-time interactive generation, significantly enhancing the performance of current diffusion-based image generation techniques. It offers a pipeline-level solution to achieve high-speed image and text-to-image generation, making interactive AI experiences more accessible. This project introduces several key features to optimize computational efficiency and GPU utilization.

SyncTalk: High-Quality Talking Head Synthesis from CVPR 2024
SyncTalk is the official repository for a CVPR 2024 paper on talking head synthesis. This project focuses on generating highly synchronized lip movements, facial expressions, and stable head poses, while also restoring hair details for high-resolution video output. It leverages tri-plane hash representations to maintain subject identity effectively.
VGGT: Visual Geometry Grounded Transformer for Rapid 3D Scene Reconstruction
VGGT, the recipient of the CVPR 2025 Best Paper Award, is a Visual Geometry Grounded Transformer developed by Facebook AI and the Visual Geometry Group at Oxford. This innovative feed-forward neural network efficiently infers key 3D scene attributes, including camera parameters, depth maps, and 3D point tracks, from single or multiple images within seconds. It offers a powerful solution for rapid 3D reconstruction and scene understanding.