GLM-OCR: Accurate, Fast, and Comprehensive Multimodal OCR Model

GLM-OCR: Accurate, Fast, and Comprehensive Multimodal OCR Model

Summary

GLM-OCR is a powerful multimodal OCR model designed for complex document understanding, built on the GLM-V encoder-decoder architecture. It achieves state-of-the-art performance across various benchmarks, offering efficient inference and easy integration. This open-source solution is optimized for real-world business scenarios, providing robust and high-quality OCR capabilities.

Repository Info

Updated on May 28, 2026
View on GitHub

Tags

Click on any tag to explore related repositories

Introduction

GLM-OCR is a powerful multimodal OCR model specifically engineered for complex document understanding. Built upon the GLM-V encoder-decoder architecture, it incorporates Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to enhance training efficiency, recognition accuracy, and generalization. The model integrates a CogViT visual encoder, a lightweight cross-modal connector, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.

Installation

The GLM-OCR SDK offers flexible installation options to suit various deployment scenarios.

For cloud or MaaS usage with local images/PDFs (fastest install):

pip install glmocr

For self-hosted pipelines requiring layout detection:

pip install "glmocr[selfhosted]"

To include Flask service support:

pip install "glmocr[server]"

For development, you can install from source:

git clone https://github.com/zai-org/glm-ocr.git
cd glm-ocr
uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install -e .

Examples

GLM-OCR provides both a Command Line Interface (CLI) and a Python API for easy interaction.

CLI Usage:

# Parse a single image
glmocr parse examples/source/code.png

# Parse a directory
glmocr parse examples/source/

# Set output directory
glmocr parse examples/source/code.png --output ./results/

# Enable debug logging with profiling
glmocr parse examples/source/code.png --log-level DEBUG

Python API Usage:

from glmocr import GlmOcr, parse

# Simple function call
result = parse("image.png")
result = parse(["img1.png", "img2.jpg"]) # List treated as pages of a single document
result.save(output_dir="./results")

# Class-based API
with GlmOcr() as parser:
    result = parser.parse("image.png")
    print(result.json_result)
    result.save()

# Place layout model on CPU
with GlmOcr(layout_device="cpu") as parser:
    result = parser.parse("image.png")

Why Use GLM-OCR?

GLM-OCR stands out for its state-of-the-art performance, ranking #1 on OmniDocBench V1.5 and achieving top results across major document understanding benchmarks, including formula and table recognition. It is specifically optimized for real-world business scenarios, maintaining robust performance on complex tables, code-heavy documents, and challenging layouts. With only 0.9B parameters, GLM-OCR supports efficient inference via vLLM, SGLang, and Ollama, significantly reducing latency and compute costs, making it ideal for high-concurrency services and edge deployments. Furthermore, it is fully open-sourced and easy to use, offering simple installation, one-line invocation, and smooth integration into existing production pipelines.

Links