Kimi-k1.5: Scaling Reinforcement Learning with LLMs and Multimodality

Introduction

Kimi-k1.5 is a groundbreaking o1-level multi-modal model developed by MoonshotAI, focusing on scaling reinforcement learning with Large Language Models (LLMs). It sets new benchmarks in both short-CoT (Chain-of-Thought) and long-CoT performance, showcasing remarkable improvements over existing models in complex reasoning tasks across multiple modalities. The project emphasizes a simplistic yet powerful RL framework, leveraging innovations in context scaling and policy optimization to achieve advanced capabilities like planning, reflection, and correction without relying on more complex techniques.

Installation

Kimi-k1.5 is presented as a research model and an accompanying paper, detailing its architecture and performance. The GitHub repository primarily serves as a hub for the research paper and related assets. As of now, explicit installation instructions for a runnable code base are not provided. For detailed information on the model's design and training, please refer to the Full Report on arXiv and monitor the GitHub repository for potential future code releases or updates.

Examples

Kimi-k1.5 showcases impressive capabilities across various benchmarks:

Short-CoT Performance: Achieves state-of-the-art results, outperforming GPT-4o and Claude Sonnet 3.5 by a significant margin (up to +550%) on tasks like AIME, MATH-500, and LiveCodeBench.
Long-CoT Performance: Matches o1 performance across multiple modalities, including MathVista, AIME, and Codeforces.
Multimodality: The model is jointly trained on text and vision data, enabling it to reason effectively across both modalities.
Advanced Reasoning: Due to its scaled context length, the learned CoTs exhibit properties of planning, reflection, and correction, effectively increasing the number of search steps without requiring complex techniques like Monte Carlo tree search or value functions.

Why Use Kimi-k1.5

Kimi-k1.5 introduces several key innovations that make it a significant development in the field of AI:

Long Context Scaling: It successfully scales the context window of RL to 128k, demonstrating continued performance improvement with increased context length. This is achieved through partial rollouts, which enhance training efficiency by reusing large chunks of previous trajectories.
Improved Policy Optimization: The model employs a robust policy optimization method, a variant of online mirror descent, specifically formulated for RL with long-CoT. This is further refined by effective sampling strategies, length penalties, and optimized data recipes.
Simplistic Framework: By combining long context scaling with improved policy optimization, Kimi-k1.5 establishes a powerful yet simplistic RL framework. It achieves strong performance and complex reasoning capabilities without the need for additional intricate components.
Joint Multimodal Reasoning: Its joint training on text and vision data provides inherent capabilities for reasoning over both modalities, making it versatile for a wide range of applications.

Kimi-k1.5: Scaling Reinforcement Learning with LLMs and Multimodality

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why Use Kimi-k1.5

Links