Repository History

5 repositories tagged with Multimodal AI

Topic: Multimodal AI

GLM-OCR: Accurate, Fast, and Comprehensive Multimodal OCR Model

GLM-OCR is a powerful multimodal OCR model designed for complex document understanding, built on the GLM-V encoder-decoder architecture. It achieves state-of-the-art performance across various benchmarks, offering efficient inference and easy integration. This open-source solution is optimized for real-world business scenarios, providing robust and high-quality OCR capabilities.

Analyzed May 28, 2026

View Details

UI-TARS-desktop: The Open-Source Multimodal AI Agent Stack

UI-TARS-desktop is an open-source multimodal AI Agent stack from ByteDance, designed to connect cutting-edge AI models with agent infrastructure. It provides both Agent TARS, a general multimodal AI agent with CLI and Web UI, and UI-TARS Desktop, a native GUI agent for local and remote computer/browser control. This powerful tool aims to enable human-like task completion through rich multimodal capabilities and seamless integration with real-world tools.

Analyzed May 6, 2026

View Details

Kimi-k1.5: Scaling Reinforcement Learning with LLMs and Multimodality

Kimi-k1.5 introduces an o1-level multi-modal model that significantly advances reinforcement learning with Large Language Models. It demonstrates state-of-the-art performance in short-CoT tasks, outperforming leading models like GPT-4o and Claude Sonnet 3.5, and matches o1 performance in long-CoT scenarios across various modalities. This project highlights key innovations in long context scaling and improved policy optimization.

Analyzed Apr 17, 2026

View Details

fast-agent: Build and Orchestrate Multimodal AI Agents and Workflows

fast-agent is a powerful Python framework designed for creating and interacting with sophisticated multimodal AI agents and workflows. It offers a simple, declarative syntax for defining agents, comprehensive model support, and unique features like end-to-end tested MCP (Multi-modal Communication Protocol) integration. Developers can rapidly build, test, and deploy complex agent applications with advanced capabilities such as structured outputs, vision, and various orchestration patterns.

Analyzed Jan 9, 2026

View Details

Attachments: The Python Funnel for LLM Context and Multimodal Data

Attachments simplifies providing context to Large Language Models by transforming various file types into model-ready text and images. This Python library acts as a universal funnel, enabling developers to integrate diverse data sources like PDFs, images, web content, and even entire code repositories with just a few lines of code. It supports popular LLM APIs and frameworks, making multimodal AI development more accessible.

Analyzed Nov 24, 2025

View Details

Previous Page 1 Next