Repository History
Explore all analyzed open source repositories

OpenDataLoader PDF: AI-Ready Data Extraction and Accessibility Automation
OpenDataLoader PDF is an open-source tool designed for extracting AI-ready data from PDFs and automating PDF accessibility. It provides structured Markdown, JSON with bounding boxes, and HTML outputs, ranking #1 in extraction accuracy benchmarks. The library also offers end-to-end auto-tagging to create screen-reader-ready Tagged PDFs, addressing critical accessibility compliance needs.

PaddleOCR: A Powerful OCR Toolkit for Structured Document Data
PaddleOCR is an industry-leading, production-ready OCR and document AI engine that transforms any PDF or image document into structured, AI-friendly data. It offers end-to-end solutions from text extraction to intelligent document understanding, supporting over 100 languages with high accuracy and efficiency.

Unstructured: Open-Source Pre-Processing for Complex Document Data
The `unstructured` library is an open-source ETL solution designed to convert complex, unstructured documents into clean, structured data. It streamlines the data processing workflow for language models, offering tools for ingesting and pre-processing various document types like PDFs, HTML, and Word documents. This library simplifies the transformation of raw information into formats suitable for advanced AI applications.