OpenDataLoader PDF: AI-Ready Data Extraction and Accessibility Automation

Summary
OpenDataLoader PDF is an open-source tool designed for extracting AI-ready data from PDFs and automating PDF accessibility. It provides structured Markdown, JSON with bounding boxes, and HTML outputs, ranking #1 in extraction accuracy benchmarks. The library also offers end-to-end auto-tagging to create screen-reader-ready Tagged PDFs, addressing critical accessibility compliance needs.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
OpenDataLoader PDF is an open-source, Apache 2.0 licensed library designed to extract AI-ready structured data from PDF documents and automate PDF accessibility. It addresses common challenges in PDF processing, such as preserving document structure, ensuring correct reading order, and generating accessible content for screen readers. Developed in collaboration with the PDF Association and Dual Lab (veraPDF developers), OpenDataLoader PDF offers high accuracy and unique features like bounding box extraction for every element and end-to-end auto-tagging to Tagged PDF.
Installation
To get started with OpenDataLoader PDF, ensure you have Java 11+ and Python 3.10+ installed.
First, install the Python package:
pip install -U opendataloader-pdf
For advanced features like hybrid mode (which offers enhanced accuracy for complex documents and OCR for scanned PDFs), install with the hybrid extra:
pip install -U "opendataloader-pdf[hybrid]"
Examples
Here are a few examples to demonstrate how to use OpenDataLoader PDF for data extraction and accessibility automation.
Extracting Structured Data
Convert PDF files to Markdown and JSON formats, which are ideal for RAG pipelines and LLMs. The JSON output includes bounding boxes for precise source citations.
import opendataloader_pdf
# Convert multiple files and/or folders in a single call
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
format="markdown,json"
)
Automating PDF Accessibility (Auto-Tagging)
Generate screen-reader-ready Tagged PDFs from untagged documents, a crucial step for accessibility compliance.
import opendataloader_pdf
# Untagged PDF in -> Tagged PDF out
opendataloader_pdf.convert(
input_path=["document.pdf"],
output_dir="output/",
format="tagged-pdf"
)
Using Hybrid Mode for Complex PDFs
For documents with complex tables, scanned content, or formulas, enable hybrid mode for superior accuracy. This requires starting a local backend server.
Terminal 1 (Start Backend Server):
opendataloader-pdf-hybrid --port 5002
Terminal 2 (Process PDFs with Hybrid Mode):
import opendataloader_pdf
opendataloader_pdf.convert(
input_path=["complex_document.pdf"],
output_dir="output/",
hybrid="docling-fast"
)
Why Use It
OpenDataLoader PDF stands out as a powerful solution for both AI data extraction and PDF accessibility, offering several key advantages:
Unmatched Data Extraction for AI and RAG
- Benchmark-Leading Accuracy: Achieves #1 overall extraction accuracy (0.907) in hybrid mode, with 0.928 table extraction accuracy. It correctly handles reading order, tables, and headings.
- Structured Output with Bounding Boxes: Provides JSON output with precise bounding box coordinates for every element, enabling "click to source" functionality in RAG applications.
- Comprehensive Document Understanding: Extracts complex tables, formulas (as LaTeX), and offers AI-generated descriptions for charts and images.
- OCR for Scanned Documents: Built-in OCR supports over 80 languages for processing image-based or poor-quality scanned PDFs.
- AI Safety Filters: Automatically filters hidden text, off-page content, and suspicious layers to prevent prompt injection attacks.
- Local and Fast: Runs 100% locally on CPU, with local mode processing 60+ pages per second and hybrid mode offering high accuracy without cloud dependency.
Pioneering PDF Accessibility Automation
- First Open-Source Auto-Tagging: It is the first open-source tool to provide end-to-end auto-tagging, converting untagged PDFs into screen-reader-ready Tagged PDFs under an Apache 2.0 license.
- Standards-Compliant: Developed in collaboration with the PDF Association and Dual Lab (veraPDF developers), ensuring compliance with the Well-Tagged PDF specification and validation using veraPDF.
- Addresses Global Regulations: Helps organizations meet requirements for the European Accessibility Act (EAA), ADA, Section 508, and other digital inclusion acts, significantly reducing manual remediation costs.
- Foundation for PDF/UA: The auto-tagging feature provides the essential Tagged PDF foundation for achieving full PDF/UA-1 or PDF/UA-2 compliance, with enterprise add-ons available for export and visual editing.
Developer-Friendly and Flexible
- Multi-Language SDKs: Available with Python, Node.js, and Java SDKs for seamless integration into existing workflows.
- LangChain Integration: Official integration available for easy use within LangChain pipelines.
- Flexible Output Formats: Supports JSON, Markdown, HTML, Annotated PDF, and Text outputs, allowing developers to choose the best format for their specific use case.
Links
- GitHub Repository: https://github.com/opendataloader-project/opendataloader-pdf
- Python Quick Start: https://opendataloader.org/docs/quick-start-python
- Node.js Quick Start: https://opendataloader.org/docs/quick-start-nodejs
- Java Quick Start: https://opendataloader.org/docs/quick-start-java
- JSON Schema Reference: https://opendataloader.org/docs/reference/json-schema
- Hybrid Mode Guide: https://opendataloader.org/docs/hybrid-mode
- PDF Accessibility Guide: https://opendataloader.org/docs/accessibility-compliance
- LangChain Integration: https://docs.langchain.com/oss/python/integrations/document_loaders/opendataloader_pdf