GLM-OCR: Accurate, Fast, and Comprehensive Multimodal OCR Model

Summary

GLM-OCR is a powerful multimodal OCR model designed for complex document understanding, built on the GLM-V encoder-decoder architecture. It achieves state-of-the-art performance across various benchmarks, offering efficient inference and easy integration. This open-source solution is optimized for real-world business scenarios, providing robust and high-quality OCR capabilities.

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

GLM-OCR is a powerful multimodal OCR model specifically engineered for complex document understanding. Built upon the GLM-V encoder-decoder architecture, it incorporates Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to enhance training efficiency, recognition accuracy, and generalization. The model integrates a CogViT visual encoder, a lightweight cross-modal connector, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.

Installation

The GLM-OCR SDK offers flexible installation options to suit various deployment scenarios.

For cloud or MaaS usage with local images/PDFs (fastest install):

pip install glmocr

For self-hosted pipelines requiring layout detection:

pip install "glmocr[selfhosted]"

To include Flask service support:

pip install "glmocr[server]"

For development, you can install from source:

git clone https://github.com/zai-org/glm-ocr.git
cd glm-ocr
uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install -e .

Examples

GLM-OCR provides both a Command Line Interface (CLI) and a Python API for easy interaction.

CLI Usage:

# Parse a single image
glmocr parse examples/source/code.png

# Parse a directory
glmocr parse examples/source/

# Set output directory
glmocr parse examples/source/code.png --output ./results/

# Enable debug logging with profiling
glmocr parse examples/source/code.png --log-level DEBUG

Python API Usage:

from glmocr import GlmOcr, parse

# Simple function call
result = parse("image.png")
result = parse(["img1.png", "img2.jpg"]) # List treated as pages of a single document
result.save(output_dir="./results")

# Class-based API
with GlmOcr() as parser:
    result = parser.parse("image.png")
    print(result.json_result)
    result.save()

# Place layout model on CPU
with GlmOcr(layout_device="cpu") as parser:
    result = parser.parse("image.png")

Why Use GLM-OCR?

GLM-OCR stands out for its state-of-the-art performance, ranking #1 on OmniDocBench V1.5 and achieving top results across major document understanding benchmarks, including formula and table recognition. It is specifically optimized for real-world business scenarios, maintaining robust performance on complex tables, code-heavy documents, and challenging layouts. With only 0.9B parameters, GLM-OCR supports efficient inference via vLLM, SGLang, and Ollama, significantly reducing latency and compute costs, making it ideal for high-concurrency services and edge deployments. Furthermore, it is fully open-sourced and easy to use, offering simple installation, one-line invocation, and smooth integration into existing production pipelines.