Repository History
5 repositories tagged with Multimodal AI

GLM-OCR: Accurate, Fast, and Comprehensive Multimodal OCR Model
GLM-OCR is a powerful multimodal OCR model designed for complex document understanding, built on the GLM-V encoder-decoder architecture. It achieves state-of-the-art performance across various benchmarks, offering efficient inference and easy integration. This open-source solution is optimized for real-world business scenarios, providing robust and high-quality OCR capabilities.

UI-TARS-desktop: The Open-Source Multimodal AI Agent Stack
UI-TARS-desktop is an open-source multimodal AI Agent stack from ByteDance, designed to connect cutting-edge AI models with agent infrastructure. It provides both Agent TARS, a general multimodal AI agent with CLI and Web UI, and UI-TARS Desktop, a native GUI agent for local and remote computer/browser control. This powerful tool aims to enable human-like task completion through rich multimodal capabilities and seamless integration with real-world tools.

Kimi-k1.5: Scaling Reinforcement Learning with LLMs and Multimodality
Kimi-k1.5 introduces an o1-level multi-modal model that significantly advances reinforcement learning with Large Language Models. It demonstrates state-of-the-art performance in short-CoT tasks, outperforming leading models like GPT-4o and Claude Sonnet 3.5, and matches o1 performance in long-CoT scenarios across various modalities. This project highlights key innovations in long context scaling and improved policy optimization.

fast-agent: Build and Orchestrate Multimodal AI Agents and Workflows
fast-agent is a powerful Python framework designed for creating and interacting with sophisticated multimodal AI agents and workflows. It offers a simple, declarative syntax for defining agents, comprehensive model support, and unique features like end-to-end tested MCP (Multi-modal Communication Protocol) integration. Developers can rapidly build, test, and deploy complex agent applications with advanced capabilities such as structured outputs, vision, and various orchestration patterns.

Attachments: The Python Funnel for LLM Context and Multimodal Data
Attachments simplifies providing context to Large Language Models by transforming various file types into model-ready text and images. This Python library acts as a universal funnel, enabling developers to integrate diverse data sources like PDFs, images, web content, and even entire code repositories with just a few lines of code. It supports popular LLM APIs and frameworks, making multimodal AI development more accessible.