sumy: Automatic Text Summarization for Documents and HTML Pages

Summary

sumy is a robust Python module designed for automatic summarization of text documents and HTML pages. It provides various summarization methods, supports multiple natural languages, and offers both a command-line utility and a flexible Python API. This versatile tool enables users to efficiently extract concise summaries from lengthy content.

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

sumy is a powerful and easy-to-use Python library for automatic text summarization. It allows you to extract concise summaries from various sources, including plain text documents and HTML pages. Built with flexibility in mind, sumy supports several popular summarization algorithms, such as LexRank, LSA, Luhn, and Edmundson, making it adaptable to different summarization needs. Furthermore, it boasts multi-language support, with an extensible framework to add new languages easily.

Installation

Getting started with sumy is straightforward. Ensure you have Python 3.6+ and pip installed on your system.

To install the stable version:

$ pip install sumy

For the very latest version directly from the GitHub repository:

$ pip install git+git://github.com/miso-belica/sumy.git

You can also run sumy as a Docker container, avoiding local installation complexities:

$ docker run --rm misobelica/sumy lex-rank --length=10 --url=https://en.wikipedia.org/wiki/Automatic_summarization

Examples

sumy provides both a command-line interface for quick summarization and a Python API for integration into your projects.

Command-Line Usage

Summarize content directly from a URL:

$ sumy lex-rank --length=10 --url=https://en.wikipedia.org/wiki/Automatic_summarization

Get help and explore more options:

$ sumy --help

sumy also includes a utility for evaluating summarization methods:

$ sumy_eval lex-rank reference_summary.txt --url=https://en.wikipedia.org/wiki/Automatic_summarization

Python API

Integrate sumy into your Python applications as a library. Here's a basic example to summarize an HTML page:

# -*- coding: utf-8 -*-

from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals

from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words


LANGUAGE = "english"
SENTENCES_COUNT = 10


if __name__ == "__main__":
    url = "https://en.wikipedia.org/wiki/Automatic_summarization"
    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
    # or for plain text files
    # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
    # parser = PlaintextParser.from_string("Check this out.", Tokenizer(LANGUAGE))
    stemmer = Stemmer(LANGUAGE)

    summarizer = Summarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)

    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)

Why Use sumy?

sumy stands out as an excellent choice for text summarization due to several key features:

Versatile Input: It can process both plain text and HTML content, making it suitable for a wide range of applications, from local documents to web scraping.
Multiple Algorithms: With implementations of various summarization techniques like LSA, LexRank, Luhn, and Edmundson, you can choose the method best suited for your specific summarization task.
Multi-language Support: sumy is designed to support multiple natural languages, and its architecture makes it easy to extend support for new languages.
Ease of Use: Whether you prefer a quick command-line summary or deep integration into a Python project, sumy offers intuitive interfaces for both.
Active Development: The project is actively maintained and has a strong community, as evidenced by its significant number of stars and forks on GitHub.
Evaluation Framework: It includes tools for evaluating the quality of generated summaries, which is crucial for research and fine-tuning.