Marker: High-Accuracy Document Conversion to Markdown and JSON

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Marker: High-Accuracy Document Conversion to Markdown and JSON

Summary

Marker is an open-source Python tool designed for high-accuracy conversion of documents like PDFs, images, and office files into Markdown, JSON, and HTML. It excels at preserving complex formatting, extracting images, and can leverage LLMs for even greater precision. This makes Marker a powerful solution for structured document intelligence.

Repository Information

Analyzed by OSRepos on November 9, 2025

Topics

Click on any tag to explore related repositories

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

Marker is a powerful, open-source tool developed by datalab-to, designed to convert various document types into structured formats such as Markdown, JSON, chunks, and HTML with high speed and accuracy. It supports a wide range of input files including PDFs, images, PPTX, DOCX, XLSX, HTML, and EPUB, across all languages.

Key features of Marker include:

  • Conversion of diverse file types.
  • Accurate formatting of tables, forms, equations, inline math, links, references, and code blocks.
  • Extraction and saving of images.
  • Intelligent removal of headers, footers, and other artifacts.
  • Extensibility for custom formatting and logic.
  • Structured extraction based on a JSON schema (beta).
  • Optional integration with Large Language Models (LLMs) for boosted accuracy.
  • Compatibility with GPU, CPU, and MPS for flexible deployment.

Marker has demonstrated favorable performance in benchmarks against leading cloud services like Llamaparse and Mathpix, as well as other open-source tools, offering superior speed and accuracy.

Installation

To get started with Marker, you'll need Python 3.10+ and PyTorch.

Install the core marker-pdf package:

pip install marker-pdf

For converting documents other than PDFs, install additional dependencies:

pip install marker-pdf[full]

Examples

Marker offers various ways to convert documents, from command-line interfaces to a Python API.

Interactive App

Try Marker interactively with a Streamlit app:

pip install streamlit streamlit-ace
marker_gui

Convert a Single File

Convert a PDF or image file from the command line:

marker_single /path/to/file.pdf

You can specify options like --output_format [markdown|json|html|chunks] and --use_llm for enhanced accuracy.

Convert Multiple Files

Process an entire folder of documents:

marker /path/to/input/folder

This command supports all the options available for marker_single.

Use from Python

Integrate Marker directly into your Python applications:

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

converter = PdfConverter(
    artifact_dict=create_model_dict(),
)
rendered = converter("FILEPATH")
text, _, images = text_from_rendered(rendered)

Marker also provides specialized converters for specific tasks, such as TableConverter for extracting tables, OCRConverter for OCR-only processing, and ExtractionConverter for structured data extraction using a JSON schema.

Why Use Marker

Marker stands out as a robust solution for document conversion due to several compelling reasons:

  • Unmatched Accuracy: Benchmarks show Marker consistently outperforms competitors in overall PDF conversion and table extraction, especially when augmented with LLMs.
  • High Performance: It offers impressive throughput, processing documents rapidly on various hardware configurations, including GPUs.
  • Versatile Input/Output: Supports a broad spectrum of document formats and provides flexible output options including Markdown, JSON, HTML, and optimized chunks for RAG applications.
  • LLM Integration: The --use_llm flag allows leveraging powerful language models like Gemini, Google Vertex, Ollama, Claude, or OpenAI for superior accuracy in complex scenarios like table merging and form extraction.
  • Extensible Architecture: Its modular design, based on Providers, Builders, Processors, and Renderers, makes it easy for developers to customize and extend its functionality.
  • Commercial Options: For enterprise needs, Datalab offers a hosted API and an on-premise solution with high uptime and competitive pricing.

Links

Related repositories

Similar repositories that may be relevant next.

TensorRT-LLM: Optimizing Large Language Model Inference on NVIDIA GPUs

TensorRT-LLM: Optimizing Large Language Model Inference on NVIDIA GPUs

July 3, 2026

TensorRT-LLM is an open-source library by NVIDIA designed to optimize inference for Large Language Models (LLMs) and Visual Generation models. It offers a user-friendly Python API, state-of-the-art optimizations, and specialized kernels to ensure efficient performance on NVIDIA GPUs. This powerful tool enables developers to deploy LLMs with high throughput and low latency, from single-GPU setups to multi-node deployments.

PythonLLMInference Optimization
DataDreamer: Streamlining Synthetic Data Generation and LLM Workflows

DataDreamer: Streamlining Synthetic Data Generation and LLM Workflows

July 3, 2026

DataDreamer is an open-source Python library designed for efficient prompting, synthetic data generation, and model training workflows. It simplifies the process of creating complex LLM workflows, generating high-quality synthetic datasets, and aligning or fine-tuning models. Built to be simple, efficient, and research-grade, DataDreamer empowers users to build reproducible and shareable AI solutions.

PythonLLMSynthetic Data
EasyInstruct: An Easy-to-Use Instruction Processing Framework for LLMs

EasyInstruct: An Easy-to-Use Instruction Processing Framework for LLMs

July 2, 2026

EasyInstruct is an open-source Python framework designed to simplify instruction processing for Large Language Models (LLMs). Accepted at ACL 2024, it offers modularized components for instruction generation, selection, and prompting, supporting various LLMs like GPT-4 and LLaMA. This framework is ideal for researchers and developers working on LLM-based experiments and applications.

EasyInstructLLM FrameworkPython
LazyLLM: Low-Code Development for Multi-Agent LLM Applications

LazyLLM: Low-Code Development for Multi-Agent LLM Applications

July 2, 2026

LazyLLM offers a low-code development tool designed for building multi-agent LLM applications with ease. It simplifies the creation of complex AI applications, providing a streamlined workflow for rapid prototyping, data feedback, and iterative optimization. Developers can leverage its extensive features for deployment, cross-platform compatibility, and efficient model fine-tuning.

PythonAI DevelopmentMulti-Agent

Source repository

Open the original repository on GitHub.

6 counted GitHub visits

View on GitHub
OS
OSRepos

Analysis and discovery of open source repositories. Find interesting projects and follow their updates.

Monitor your website with YourWebsiteScore

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of third-party repository code is at your own risk. Always review source code, dependencies, licenses, and security implications before running anything.

© 2025 OSRepos. Built with Nuxt 3 and lots of ❤️