PDF Craft: Convert Scanned PDF Books to Markdown and EPUB

Introduction

PDF Craft is a powerful Python library that specializes in converting PDF files into other formats, with a particular focus on scanned books. It utilizes DeepSeek OCR for robust document recognition, capable of handling complex content such as tables and formulas. This tool ensures that the converted Markdown or EPUB files maintain the integrity and readability of the original document, including proper handling of footnotes, images, and automatic table of contents generation.

Installation

To get started with PDF Craft, you can install it using pip. Note that you will also need to install Poppler for PDF parsing and configure a CUDA environment for OCR recognition for actual conversion.

pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install pdf-craft

For detailed instructions on installing Poppler and configuring CUDA, please refer to the official Installation Guide.

Examples

PDF Craft provides straightforward APIs for converting PDFs to Markdown or EPUB.

Convert to Markdown

from pdf_craft import transform_markdown

transform_markdown(
    pdf_path="input.pdf",
    markdown_path="output.md",
    markdown_assets_path="images",
)

Convert to EPUB

from pdf_craft import transform_epub, BookMeta

transform_epub(
    pdf_path="input.pdf",
    epub_path="output.epub",
    book_meta=BookMeta(
        title="Book Title",
        authors=["Author"],
    ),
)

Why Use PDF Craft

PDF Craft stands out for its lightweight and fast performance. By fully embracing DeepSeek OCR and operating locally, it eliminates network requests and long waiting times, ensuring efficient conversions. It excels at accurately identifying document structure, extracting body text, and filtering out interfering elements like headers and footers, making it highly effective for academic or technical documents. An online demo platform is also available to experience its capabilities without any installation.

PDF Craft: Convert Scanned PDF Books to Markdown and EPUB

Summary

Repository Information

Topics

Use at your own risk