python-readability: Extract Clean Main Content from HTML Documents

Introduction

python-readability is a powerful and fast Python library that allows developers to extract the main body text and title from any HTML document. It is a robust port of arc90's original Readability project, continuously updated to align with the latest readability.js features. This tool is invaluable for tasks requiring clean, focused content extraction from web pages, such as building RSS feeds, content aggregators, or text analysis tools.

Installation

Installation is straightforward using pip or conda.

$ pip install readability-lxml

Alternatively, with conda:

$ conda install -c conda-forge readability-lxml

Examples

Using python-readability is simple. Here's a quick example demonstrating how to fetch a web page and extract its title and summary:

import requests
from readability import Document

response = requests.get('http://example.com')
doc = Document(response.content)
print(doc.title())
# Output: 'Example Domain'

print(doc.summary())
# Output: "<html><body><div><body id=\"readabilityBody\">\n<div>\n    <h1>Example Domain</h1>\n\n<p>This domain is established to be used for illustrative examples in documents. You may\nuse this\n    domain in examples without prior coordination or asking for permission.</p>\n\n    <p><a href=\"http://www.iana.org/domains/example\">More information...</a></p>\n</div>\n\n</body>\n</div></body></html>"

Why use it

python-readability stands out for several reasons:

Speed and Efficiency: It's a fast implementation, crucial for processing large volumes of data.
Modern Compatibility: Regularly updated to match readability.js, ensuring it works well with contemporary web content.
Comprehensive Extraction: Beyond just text, it can extract titles, handle images (including saving all images with keep_all_images=True), and supports CJK characters.
Python 3.x Support: Fully compatible with a wide range of Python 3 versions (3.8 - 3.13).
Clean Output: Replaces XHTML output with HTML5 in summary() calls, providing modern and cleaner HTML.
Active Development: The change log indicates continuous improvements and bug fixes, reflecting an actively maintained project.

python-readability: Extract Clean Main Content from HTML Documents

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why use it

Links