Newspaper3k: Advanced News and Article Extraction in Python

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Newspaper3k: Advanced News and Article Extraction in Python

Summary

Newspaper3k is a powerful Python 3 library designed for news, full-text, and article metadata extraction. Inspired by the simplicity of 'requests' and the speed of 'lxml', it provides robust tools for scraping and curating articles from various sources. This library is ideal for developers needing to programmatically gather and process news content with advanced NLP capabilities.

Repository Information

Analyzed by OSRepos on October 13, 2025

Topics

Click on any tag to explore related repositories

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

Newspaper3k is an exceptional Python 3 library that streamlines the process of extracting news, full-text content, and article metadata from websites. It's built to be simple to use, much like the requests library, and leverages lxml for high-speed parsing. Whether you need to pull authors, publication dates, main text, images, or even perform Natural Language Processing (NLP) for keywords and summaries, Newspaper3k offers a comprehensive solution.

This library is not just about basic scraping; it's designed for advanced article curation, capable of identifying news URLs, handling multi-threaded downloads, and working seamlessly across more than 10 languages, including English, Chinese, German, and Arabic.

Installation

To get started with Newspaper3k, ensure you are using Python 3. The library is installed via pip3.

Important: Install newspaper3k, not newspaper. The newspaper package is for Python 2 and is deprecated.

pip3 install newspaper3k

For Debian / Ubuntu users, you might need to install additional dependencies:

sudo apt-get install python3-pip
sudo apt-get install python-dev
sudo apt-get install libxml2-dev libxslt-dev
sudo apt-get install libjpeg-dev zlib1g-dev libpng-dev
curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3
pip3 install newspaper3k

For OSX users, using Homebrew or Macports:

brew install libxml2 libxslt
brew install libtiff libjpeg webp little-cms2
pip3 install newspaper3k
curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

Examples

Here are some examples demonstrating how to use Newspaper3k to extract information from articles and news sources.

Extracting a Single Article

from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)

article.download()
article.parse()

print(f"Authors: {article.authors}")
print(f"Publish Date: {article.publish_date}")
print(f"Text: {article.text[:200]}...")
print(f"Top Image: {article.top_image}")
print(f"Movies: {article.movies}")

article.nlp()

print(f"Keywords: {article.keywords}")
print(f"Summary: {article.summary}")

Building a News Source (Paper)

import newspaper

cnn_paper = newspaper.build('http://cnn.com')

print("First 5 article URLs from CNN:")
for article in cnn_paper.articles[:5]:
    print(article.url)

print("\nCategory URLs from CNN:")
for category in cnn_paper.category_urls():
    print(category)

# You can then download, parse, and NLP individual articles from the paper
cnn_article = cnn_paper.articles[0]
cnn_article.download()
cnn_article.parse()
cnn_article.nlp()
print(f"\nFirst CNN article title: {cnn_article.title}")

Language Detection and Specific Language Usage

Newspaper3k can automatically detect languages or be instructed to use a specific one.

from newspaper import Article

# Example with Chinese article, specifying language
url_chinese = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
a_chinese = Article(url_chinese, language='zh') # Chinese

a_chinese.download()
a_chinese.parse()

print(f"Chinese Article Title: {a_chinese.title}")
print(f"Chinese Article Text (first 150 chars): {a_chinese.text[:150]}...")

# Building a paper for a specific language
sina_paper = newspaper.build('http://www.sina.com.cn/', language='zh')
article_sina = sina_paper.articles[0]
article_sina.download()
article_sina.parse()
print(f"\nSina Article Title: {article_sina.title}")

Why Use Newspaper3k?

Newspaper3k stands out for several reasons, making it an excellent choice for news and article extraction tasks:

  • Robust Extraction: It reliably extracts text, authors, publication dates, top images, and all images from HTML content.
  • NLP Capabilities: Built-in Natural Language Processing allows for keyword and summary extraction, providing deeper insights into article content.
  • Multi-threaded Downloads: Efficiently download multiple articles concurrently, speeding up data collection.
  • Multi-language Support: Works in over 10 languages, with seamless auto-detection or explicit language specification, making it versatile for global news sources.
  • News URL Identification: Smartly identifies news-related URLs, helping to focus your scraping efforts.
  • Ease of Use: Its API is designed to be intuitive and straightforward, allowing developers to quickly integrate it into their projects.

Links

Related repositories

Similar repositories that may be relevant next.

Source repository

Open the original repository on GitHub.

7 counted GitHub visits

View on GitHub
OS
OSRepos

Analysis and discovery of open source repositories. Find interesting projects and follow their updates.

Monitor your website with YourWebsiteScore

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of third-party repository code is at your own risk. Always review source code, dependencies, licenses, and security implications before running anything.

© 2025 OSRepos. Built with Nuxt 3 and lots of ❤️