PDF Craft: Convert Scanned PDF Books to Markdown and EPUB

Summary
PDF Craft is a Python library designed to convert PDF files, especially scanned books, into various formats like Markdown and EPUB. Leveraging DeepSeek OCR, it accurately extracts text, tables, and formulas while preserving document structure. The project offers a fast, local conversion process, making it ideal for digitizing complex documents.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
PDF Craft is a powerful Python library that specializes in converting PDF files into other formats, with a particular focus on scanned books. It utilizes DeepSeek OCR for robust document recognition, capable of handling complex content such as tables and formulas. This tool ensures that the converted Markdown or EPUB files maintain the integrity and readability of the original document, including proper handling of footnotes, images, and automatic table of contents generation.
Installation
To get started with PDF Craft, you can install it using pip. Note that you will also need to install Poppler for PDF parsing and configure a CUDA environment for OCR recognition for actual conversion.
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install pdf-craft
For detailed instructions on installing Poppler and configuring CUDA, please refer to the official Installation Guide.
Examples
PDF Craft provides straightforward APIs for converting PDFs to Markdown or EPUB.
Convert to Markdown
from pdf_craft import transform_markdown
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
markdown_assets_path="images",
)
Convert to EPUB
from pdf_craft import transform_epub, BookMeta
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
book_meta=BookMeta(
title="Book Title",
authors=["Author"],
),
)
Why Use PDF Craft
PDF Craft stands out for its lightweight and fast performance. By fully embracing DeepSeek OCR and operating locally, it eliminates network requests and long waiting times, ensuring efficient conversions. It excels at accurately identifying document structure, extracting body text, and filtering out interfering elements like headers and footers, making it highly effective for academic or technical documents. An online demo platform is also available to experience its capabilities without any installation.