pypdf: A Powerful Pure-Python Library for PDF Manipulation

Summary
pypdf is a free and open-source pure-Python library designed for comprehensive PDF manipulation. It allows users to split, merge, crop, and transform PDF pages, as well as add custom data, viewing options, and passwords. The library also supports extracting text and metadata from PDF files, making it a versatile tool for various PDF-related tasks.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
pypdf is a robust, pure-Python PDF library that empowers developers to interact with PDF files programmatically. It offers a wide array of functionalities, including splitting, merging, cropping, and transforming pages. Beyond basic manipulation, pypdf can also add custom data, set viewing options, and apply password protection to your PDF documents. Furthermore, it provides capabilities to extract text and metadata, making it an essential tool for automating PDF workflows.
Installation
Getting started with pypdf is straightforward using pip.
pip install pypdf
For advanced features like AES encryption or decryption, you can install additional dependencies:
pip install pypdf[crypto]
Note that pypdf versions 3.1.0 and above include significant improvements. Please refer to the official migration guide for more details.
Examples
Here's a quick example demonstrating how to read a PDF and extract text from its first page:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
print(f"Number of pages: {number_of_pages}")
print(f"Text from first page: {text[:200]}...") # Print first 200 chars
pypdf supports many other operations, such as splitting, merging, reading and creating annotations, and encryption/decryption. Check out the documentation for additional usage examples!
Why Use pypdf
pypdf stands out as a comprehensive solution for PDF handling in Python due to several key advantages. Its pure-Python implementation ensures broad compatibility and ease of integration into Python projects without external binaries. The library's extensive feature set covers everything from basic page manipulation to advanced tasks like metadata extraction and security. With an active development team and a supportive community, pypdf is continuously improved and well-maintained, offering reliable performance for your PDF processing needs.
Links
- GitHub Repository: https://github.com/py-pdf/pypdf
- Official Documentation: https://pypdf.readthedocs.io/en/stable/
- StackOverflow (Q&A): https://stackoverflow.com/questions/tagged/pypdf