python-readability: Extract Clean Main Content from HTML Documents

python-readability: Extract Clean Main Content from HTML Documents

Summary

python-readability is a fast Python port of arc90's Readability tool, designed to extract and clean the main body text and title from any given HTML document. It provides an efficient way to process web content, making it easier to focus on essential information. This library is regularly updated to match the latest readability.js functionalities, ensuring modern compatibility and performance.

Repository Info

Updated on November 7, 2025
View on GitHub

Introduction

python-readability is a powerful and fast Python library that allows developers to extract the main body text and title from any HTML document. It is a robust port of arc90's original Readability project, continuously updated to align with the latest readability.js features. This tool is invaluable for tasks requiring clean, focused content extraction from web pages, such as building RSS feeds, content aggregators, or text analysis tools.

Installation

Installation is straightforward using pip or conda.

$ pip install readability-lxml

Alternatively, with conda:

$ conda install -c conda-forge readability-lxml

Examples

Using python-readability is simple. Here's a quick example demonstrating how to fetch a web page and extract its title and summary:

import requests
from readability import Document

response = requests.get('http://example.com')
doc = Document(response.content)
print(doc.title())
# Output: 'Example Domain'

print(doc.summary())
# Output: "<html><body><div><body id=\"readabilityBody\">\n<div>\n    <h1>Example Domain</h1>\n\n<p>This domain is established to be used for illustrative examples in documents. You may\nuse this\n    domain in examples without prior coordination or asking for permission.</p>\n\n    <p><a href=\"http://www.iana.org/domains/example\">More information...</a></p>\n</div>\n\n</body>\n</div></body></html>"

Why use it

python-readability stands out for several reasons:

  • Speed and Efficiency: It's a fast implementation, crucial for processing large volumes of data.
  • Modern Compatibility: Regularly updated to match readability.js, ensuring it works well with contemporary web content.
  • Comprehensive Extraction: Beyond just text, it can extract titles, handle images (including saving all images with keep_all_images=True), and supports CJK characters.
  • Python 3.x Support: Fully compatible with a wide range of Python 3 versions (3.8 - 3.13).
  • Clean Output: Replaces XHTML output with HTML5 in summary() calls, providing modern and cleaner HTML.
  • Active Development: The change log indicates continuous improvements and bug fixes, reflecting an actively maintained project.

Links