Repository History

3 repositories tagged with document-processing

Topic: document-processing

pdfplumber: Extracting Data from PDFs with Ease and Precision

pdfplumber is a powerful Python library designed to extract detailed information from PDFs, including characters, rectangles, and lines. It excels at easily extracting text and tables, making it an invaluable tool for data analysis and automation. Built on pdfminer.six, it provides robust PDF parsing capabilities.

Analyzed Jan 24, 2026

View Details

E2M: Convert Various File Types to Markdown for RAG and LLM Training

E2M is a Python library designed to convert diverse file types, including documents, web pages, and audio, into Markdown format. It features a robust parser-converter architecture, making it highly flexible and easy to integrate. This tool is specifically aimed at generating high-quality data for Retrieval-Augmented Generation (RAG) and large language model training.

Analyzed Dec 24, 2025

View Details

sumy: Automatic Text Summarization for Documents and HTML Pages

sumy is a robust Python module designed for automatic summarization of text documents and HTML pages. It provides various summarization methods, supports multiple natural languages, and offers both a command-line utility and a flexible Python API. This versatile tool enables users to efficiently extract concise summaries from lengthy content.

Analyzed Dec 14, 2025

View Details

Previous Page 1 Next