GLM-OCR: Accurate, Fast, and Comprehensive Multimodal OCR Model

Summary
GLM-OCR is a powerful multimodal OCR model designed for complex document understanding, built on the GLM-V encoder-decoder architecture. It achieves state-of-the-art performance across various benchmarks, offering efficient inference and easy integration. This open-source solution is optimized for real-world business scenarios, providing robust and high-quality OCR capabilities.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
GLM-OCR is a powerful multimodal OCR model specifically engineered for complex document understanding. Built upon the GLM-V encoder-decoder architecture, it incorporates Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to enhance training efficiency, recognition accuracy, and generalization. The model integrates a CogViT visual encoder, a lightweight cross-modal connector, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.
Installation
The GLM-OCR SDK offers flexible installation options to suit various deployment scenarios.
For cloud or MaaS usage with local images/PDFs (fastest install):
pip install glmocr
For self-hosted pipelines requiring layout detection:
pip install "glmocr[selfhosted]"
To include Flask service support:
pip install "glmocr[server]"
For development, you can install from source:
git clone https://github.com/zai-org/glm-ocr.git
cd glm-ocr
uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install -e .
Examples
GLM-OCR provides both a Command Line Interface (CLI) and a Python API for easy interaction.
CLI Usage:
# Parse a single image
glmocr parse examples/source/code.png
# Parse a directory
glmocr parse examples/source/
# Set output directory
glmocr parse examples/source/code.png --output ./results/
# Enable debug logging with profiling
glmocr parse examples/source/code.png --log-level DEBUG
Python API Usage:
from glmocr import GlmOcr, parse
# Simple function call
result = parse("image.png")
result = parse(["img1.png", "img2.jpg"]) # List treated as pages of a single document
result.save(output_dir="./results")
# Class-based API
with GlmOcr() as parser:
result = parser.parse("image.png")
print(result.json_result)
result.save()
# Place layout model on CPU
with GlmOcr(layout_device="cpu") as parser:
result = parser.parse("image.png")
Why Use GLM-OCR?
GLM-OCR stands out for its state-of-the-art performance, ranking #1 on OmniDocBench V1.5 and achieving top results across major document understanding benchmarks, including formula and table recognition. It is specifically optimized for real-world business scenarios, maintaining robust performance on complex tables, code-heavy documents, and challenging layouts. With only 0.9B parameters, GLM-OCR supports efficient inference via vLLM, SGLang, and Ollama, significantly reducing latency and compute costs, making it ideal for high-concurrency services and edge deployments. Furthermore, it is fully open-sourced and easy to use, offering simple installation, one-line invocation, and smooth integration into existing production pipelines.
Links
- GitHub Repository: https://github.com/zai-org/GLM-OCR
- Technical Report: https://arxiv.org/abs/2603.10910
- GLM-OCR API Documentation: https://docs.z.ai/guides/vlm/glm-ocr
- Hugging Face Model: https://huggingface.co/zai-org/GLM-OCR
- ModelScope Model: https://modelscope.cn/models/ZhipuAI/GLM-OCR