Repository History

Explore all analyzed open source repositories

Topic: document-parsing

OpenDataLoader PDF: AI-Ready Data Extraction and Accessibility Automation

OpenDataLoader PDF is an open-source tool designed for extracting AI-ready data from PDFs and automating PDF accessibility. It provides structured Markdown, JSON with bounding boxes, and HTML outputs, ranking #1 in extraction accuracy benchmarks. The library also offers end-to-end auto-tagging to create screen-reader-ready Tagged PDFs, addressing critical accessibility compliance needs.

May 30, 2026

View Details

PaddleOCR: A Powerful OCR Toolkit for Structured Document Data

PaddleOCR is an industry-leading, production-ready OCR and document AI engine that transforms any PDF or image document into structured, AI-friendly data. It offers end-to-end solutions from text extraction to intelligent document understanding, supporting over 100 languages with high accuracy and efficiency.

Mar 14, 2026

View Details

Unstructured: Open-Source Pre-Processing for Complex Document Data

The `unstructured` library is an open-source ETL solution designed to convert complex, unstructured documents into clean, structured data. It streamlines the data processing workflow for language models, offering tools for ingesting and pre-processing various document types like PDFs, HTML, and Word documents. This library simplifies the transformation of raw information into formats suitable for advanced AI applications.

Feb 10, 2026

View Details

Page 1