Kimi-k1.5: Scaling Reinforcement Learning with LLMs and Multimodality

Kimi-k1.5: Scaling Reinforcement Learning with LLMs and Multimodality

Summary

Kimi-k1.5 introduces an o1-level multi-modal model that significantly advances reinforcement learning with Large Language Models. It demonstrates state-of-the-art performance in short-CoT tasks, outperforming leading models like GPT-4o and Claude Sonnet 3.5, and matches o1 performance in long-CoT scenarios across various modalities. This project highlights key innovations in long context scaling and improved policy optimization.

Repository Info

Updated on April 17, 2026
View on GitHub

Introduction

Kimi-k1.5 is a groundbreaking o1-level multi-modal model developed by MoonshotAI, focusing on scaling reinforcement learning with Large Language Models (LLMs). It sets new benchmarks in both short-CoT (Chain-of-Thought) and long-CoT performance, showcasing remarkable improvements over existing models in complex reasoning tasks across multiple modalities. The project emphasizes a simplistic yet powerful RL framework, leveraging innovations in context scaling and policy optimization to achieve advanced capabilities like planning, reflection, and correction without relying on more complex techniques.

Installation

Kimi-k1.5 is presented as a research model and an accompanying paper, detailing its architecture and performance. The GitHub repository primarily serves as a hub for the research paper and related assets. As of now, explicit installation instructions for a runnable code base are not provided. For detailed information on the model's design and training, please refer to the Full Report on arXiv and monitor the GitHub repository for potential future code releases or updates.

Examples

Kimi-k1.5 showcases impressive capabilities across various benchmarks:

  • Short-CoT Performance: Achieves state-of-the-art results, outperforming GPT-4o and Claude Sonnet 3.5 by a significant margin (up to +550%) on tasks like AIME, MATH-500, and LiveCodeBench.
  • Long-CoT Performance: Matches o1 performance across multiple modalities, including MathVista, AIME, and Codeforces.
  • Multimodality: The model is jointly trained on text and vision data, enabling it to reason effectively across both modalities.
  • Advanced Reasoning: Due to its scaled context length, the learned CoTs exhibit properties of planning, reflection, and correction, effectively increasing the number of search steps without requiring complex techniques like Monte Carlo tree search or value functions.

Why Use Kimi-k1.5

Kimi-k1.5 introduces several key innovations that make it a significant development in the field of AI:

  • Long Context Scaling: It successfully scales the context window of RL to 128k, demonstrating continued performance improvement with increased context length. This is achieved through partial rollouts, which enhance training efficiency by reusing large chunks of previous trajectories.
  • Improved Policy Optimization: The model employs a robust policy optimization method, a variant of online mirror descent, specifically formulated for RL with long-CoT. This is further refined by effective sampling strategies, length penalties, and optimized data recipes.
  • Simplistic Framework: By combining long context scaling with improved policy optimization, Kimi-k1.5 establishes a powerful yet simplistic RL framework. It achieves strong performance and complex reasoning capabilities without the need for additional intricate components.
  • Joint Multimodal Reasoning: Its joint training on text and vision data provides inherent capabilities for reasoning over both modalities, making it versatile for a wide range of applications.

Links