python-readability: Extract Clean Main Content from HTML Documents

Summary
python-readability is a fast Python port of arc90's Readability tool, designed to extract and clean the main body text and title from any given HTML document. It provides an efficient way to process web content, making it easier to focus on essential information. This library is regularly updated to match the latest readability.js functionalities, ensuring modern compatibility and performance.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
python-readability is a powerful and fast Python library that allows developers to extract the main body text and title from any HTML document. It is a robust port of arc90's original Readability project, continuously updated to align with the latest readability.js features. This tool is invaluable for tasks requiring clean, focused content extraction from web pages, such as building RSS feeds, content aggregators, or text analysis tools.
Installation
Installation is straightforward using pip or conda.
$ pip install readability-lxml
Alternatively, with conda:
$ conda install -c conda-forge readability-lxml
Examples
Using python-readability is simple. Here's a quick example demonstrating how to fetch a web page and extract its title and summary:
import requests
from readability import Document
response = requests.get('http://example.com')
doc = Document(response.content)
print(doc.title())
# Output: 'Example Domain'
print(doc.summary())
# Output: "<html><body><div><body id=\"readabilityBody\">\n<div>\n <h1>Example Domain</h1>\n\n<p>This domain is established to be used for illustrative examples in documents. You may\nuse this\n domain in examples without prior coordination or asking for permission.</p>\n\n <p><a href=\"http://www.iana.org/domains/example\">More information...</a></p>\n</div>\n\n</body>\n</div></body></html>"
Why use it
python-readability stands out for several reasons:
- Speed and Efficiency: It's a fast implementation, crucial for processing large volumes of data.
- Modern Compatibility: Regularly updated to match
readability.js, ensuring it works well with contemporary web content. - Comprehensive Extraction: Beyond just text, it can extract titles, handle images (including saving all images with
keep_all_images=True), and supports CJK characters. - Python 3.x Support: Fully compatible with a wide range of Python 3 versions (3.8 - 3.13).
- Clean Output: Replaces XHTML output with HTML5 in
summary()calls, providing modern and cleaner HTML. - Active Development: The change log indicates continuous improvements and bug fixes, reflecting an actively maintained project.
Links
- GitHub Repository: https://github.com/buriy/python-readability
- PyPI Package: https://pypi.python.org/pypi/readability-lxml