Repository History

16 repositories tagged with Computer Vision

Topic: Computer Vision
PyTorch Image Models (timm): The Ultimate Collection of Image Encoders

PyTorch Image Models (timm): The Ultimate Collection of Image Encoders

PyTorch Image Models (timm) is an extensive library offering the largest collection of PyTorch image encoders and backbones. It provides a wide array of state-of-the-art models, complete with pretrained weights, training, evaluation, and inference scripts. This makes it an invaluable resource for researchers and developers working with computer vision tasks in PyTorch.

Analyzed May 5, 2026
View Details
CoTracker: A Powerful Model for Tracking Any Point on a Video

CoTracker: A Powerful Model for Tracking Any Point on a Video

CoTracker is a state-of-the-art model developed by Facebook AI Research and the University of Oxford, designed for tracking any point (pixel) across video sequences. This transformer-based solution offers fast, accurate, and quasi-dense point tracking capabilities. It is an invaluable tool for researchers and developers in computer vision, enabling precise analysis of motion in videos.

Analyzed Apr 11, 2026
View Details
WiFi-3D-Fusion: Real-Time 3D Human Pose Estimation from WiFi Signals

WiFi-3D-Fusion: Real-Time 3D Human Pose Estimation from WiFi Signals

WiFi-3D-Fusion is an innovative open-source research project that leverages WiFi CSI signals and deep learning to estimate 3D human pose. It uniquely fuses wireless sensing with computer vision techniques, providing next-generation spatial awareness. This project offers real-time motion detection and visualization, showcasing a novel approach to understanding human movement in 3D space.

Analyzed Mar 15, 2026
View Details
GigaSLAM: Large-Scale Monocular SLAM with Hierarchical Gaussian Splats

GigaSLAM: Large-Scale Monocular SLAM with Hierarchical Gaussian Splats

GigaSLAM is a groundbreaking monocular SLAM framework designed for kilometer-scale outdoor environments. It leverages hierarchical Gaussian splats and neural networks to achieve efficient, scalable mapping and high-fidelity rendering. This system addresses the challenges of large-scale tracking and mapping using only RGB input, extending the applicability of Gaussian Splatting SLAM to unbounded outdoor scenes.

Analyzed Feb 28, 2026
View Details
MonoPCC: Photometric-invariant Cycle Constraint for Monocular Depth Estimation

MonoPCC: Photometric-invariant Cycle Constraint for Monocular Depth Estimation

MonoPCC is a PyTorch implementation for monocular depth estimation, specifically designed for endoscopic images using a photometric-invariant cycle constraint. This self-supervised learning approach aims to improve depth prediction accuracy in challenging medical imaging scenarios. It demonstrates state-of-the-art performance on datasets like SCARED and KITTI, and offers a plug-and-play design for integration into various backbone networks.

Analyzed Feb 27, 2026
View Details
big_vision: Google Research's Codebase for Large-Scale Vision Models

big_vision: Google Research's Codebase for Large-Scale Vision Models

big_vision is Google Research's official codebase for training large-scale vision models using Jax/Flax. It has been instrumental in developing prominent architectures like Vision Transformer, SigLIP, and MLP-Mixer. This repository offers a robust starting point for researchers to conduct scalable vision experiments on GPUs and Cloud TPUs, scaling seamlessly from single cores to distributed setups.

Analyzed Dec 31, 2025
View Details
HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation

HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation

HunyuanVideo-Avatar is a cutting-edge project by Tencent-Hunyuan for high-fidelity, audio-driven human animation. Utilizing a multimodal diffusion transformer, it generates dynamic, emotion-controllable, and multi-character dialogue videos. This innovative system addresses critical challenges in character consistency, emotion alignment, and multi-character animation, making it suitable for diverse applications like e-commerce and social media.

Analyzed Dec 30, 2025
View Details
OmniParser: A Vision-Based Tool for GUI Agent Screen Parsing

OmniParser: A Vision-Based Tool for GUI Agent Screen Parsing

OmniParser is a comprehensive tool developed by Microsoft for parsing user interface screenshots into structured, understandable elements. It significantly enhances the ability of vision-based models, such as GPT-4V, to generate accurate actions grounded in specific regions of a GUI. This project aims to advance pure vision-based GUI agents by providing robust screen parsing capabilities.

Analyzed Dec 28, 2025
View Details
CineScale: Unlocking 4K High-Resolution Cinematic Video Generation

CineScale: Unlocking 4K High-Resolution Cinematic Video Generation

CineScale is an innovative GitHub repository by Eyeline-Labs, extending FreeScale to enable high-resolution cinematic video generation. It provides models and tools to achieve up to 4K video output, leveraging diffusion models for advanced visual content creation. This project offers a robust framework for researchers and developers to generate stunning, high-definition videos.

Analyzed Dec 18, 2025
View Details
StreamDiffusion: Real-Time Interactive Generation with Diffusion Pipelines

StreamDiffusion: Real-Time Interactive Generation with Diffusion Pipelines

StreamDiffusion is an innovative diffusion pipeline designed for real-time interactive generation, significantly enhancing the performance of current diffusion-based image generation techniques. It offers a pipeline-level solution to achieve high-speed image and text-to-image generation, making interactive AI experiences more accessible. This project introduces several key features to optimize computational efficiency and GPU utilization.

Analyzed Dec 13, 2025
View Details
SyncTalk: High-Quality Talking Head Synthesis from CVPR 2024

SyncTalk: High-Quality Talking Head Synthesis from CVPR 2024

SyncTalk is the official repository for a CVPR 2024 paper on talking head synthesis. This project focuses on generating highly synchronized lip movements, facial expressions, and stable head poses, while also restoring hair details for high-resolution video output. It leverages tri-plane hash representations to maintain subject identity effectively.

Analyzed Dec 11, 2025
View Details
VGGT: Visual Geometry Grounded Transformer for Rapid 3D Scene Reconstruction

VGGT: Visual Geometry Grounded Transformer for Rapid 3D Scene Reconstruction

VGGT, the recipient of the CVPR 2025 Best Paper Award, is a Visual Geometry Grounded Transformer developed by Facebook AI and the Visual Geometry Group at Oxford. This innovative feed-forward neural network efficiently infers key 3D scene attributes, including camera parameters, depth maps, and 3D point tracks, from single or multiple images within seconds. It offers a powerful solution for rapid 3D reconstruction and scene understanding.

Analyzed Dec 10, 2025
View Details
Previous Page 1 Next
OS
OSRepos

Analysis and discovery of open source repositories. Find interesting projects and follow their updates.

Monitor your website with YourWebsiteScore

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of third-party repository code is at your own risk. Always review source code, dependencies, licenses, and security implications before running anything.

© 2025 OSRepos. Built with Nuxt 3 and lots of ❤️