{"name":"Newspaper3k: Advanced News and Article Extraction in Python","description":"Newspaper3k is a powerful Python 3 library designed for news, full-text, and article metadata extraction. Inspired by the simplicity of 'requests' and the speed of 'lxml', it provides robust tools for scraping and curating articles from various sources. This library is ideal for developers needing to programmatically gather and process news content with advanced NLP capabilities.","github":"https://github.com/codelucas/newspaper","url":"https://osrepos.com/repo/codelucas-newspaper","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/codelucas-newspaper","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/codelucas-newspaper.md","json":"https://osrepos.com/repo/codelucas-newspaper.json","topics":["crawler","crawling","news","news-aggregator","python","scraper","web scraping","data extraction"],"keywords":["crawler","crawling","news","news-aggregator","python","scraper","web scraping","data extraction"],"stars":null,"summary":"Newspaper3k is a powerful Python 3 library designed for news, full-text, and article metadata extraction. Inspired by the simplicity of 'requests' and the speed of 'lxml', it provides robust tools for scraping and curating articles from various sources. This library is ideal for developers needing to programmatically gather and process news content with advanced NLP capabilities.","content":"## Introduction\n\nNewspaper3k is an exceptional Python 3 library that streamlines the process of extracting news, full-text content, and article metadata from websites. It's built to be simple to use, much like the `requests` library, and leverages `lxml` for high-speed parsing. Whether you need to pull authors, publication dates, main text, images, or even perform Natural Language Processing (NLP) for keywords and summaries, Newspaper3k offers a comprehensive solution.\n\nThis library is not just about basic scraping; it's designed for advanced article curation, capable of identifying news URLs, handling multi-threaded downloads, and working seamlessly across more than 10 languages, including English, Chinese, German, and Arabic.\n\n## Installation\n\nTo get started with Newspaper3k, ensure you are using Python 3. The library is installed via `pip3`.\n\n**Important:** Install `newspaper3k`, not `newspaper`. The `newspaper` package is for Python 2 and is deprecated.\n\nbash\npip3 install newspaper3k\n\n\nFor Debian / Ubuntu users, you might need to install additional dependencies:\n\nbash\nsudo apt-get install python3-pip\nsudo apt-get install python-dev\nsudo apt-get install libxml2-dev libxslt-dev\nsudo apt-get install libjpeg-dev zlib1g-dev libpng-dev\ncurl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3\npip3 install newspaper3k\n\n\nFor OSX users, using Homebrew or Macports:\n\nbash\nbrew install libxml2 libxslt\nbrew install libtiff libjpeg webp little-cms2\npip3 install newspaper3k\ncurl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3\n\n\n## Examples\n\nHere are some examples demonstrating how to use Newspaper3k to extract information from articles and news sources.\n\n### Extracting a Single Article\n\npython\nfrom newspaper import Article\n\nurl = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'\narticle = Article(url)\n\narticle.download()\narticle.parse()\n\nprint(f\"Authors: {article.authors}\")\nprint(f\"Publish Date: {article.publish_date}\")\nprint(f\"Text: {article.text[:200]}...\")\nprint(f\"Top Image: {article.top_image}\")\nprint(f\"Movies: {article.movies}\")\n\narticle.nlp()\n\nprint(f\"Keywords: {article.keywords}\")\nprint(f\"Summary: {article.summary}\")\n\n\n### Building a News Source (Paper)\n\npython\nimport newspaper\n\ncnn_paper = newspaper.build('http://cnn.com')\n\nprint(\"First 5 article URLs from CNN:\")\nfor article in cnn_paper.articles[:5]:\n    print(article.url)\n\nprint(\"\\nCategory URLs from CNN:\")\nfor category in cnn_paper.category_urls():\n    print(category)\n\n# You can then download, parse, and NLP individual articles from the paper\ncnn_article = cnn_paper.articles[0]\ncnn_article.download()\ncnn_article.parse()\ncnn_article.nlp()\nprint(f\"\\nFirst CNN article title: {cnn_article.title}\")\n\n\n### Language Detection and Specific Language Usage\n\nNewspaper3k can automatically detect languages or be instructed to use a specific one.\n\npython\nfrom newspaper import Article\n\n# Example with Chinese article, specifying language\nurl_chinese = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'\na_chinese = Article(url_chinese, language='zh') # Chinese\n\na_chinese.download()\na_chinese.parse()\n\nprint(f\"Chinese Article Title: {a_chinese.title}\")\nprint(f\"Chinese Article Text (first 150 chars): {a_chinese.text[:150]}...\")\n\n# Building a paper for a specific language\nsina_paper = newspaper.build('http://www.sina.com.cn/', language='zh')\narticle_sina = sina_paper.articles[0]\narticle_sina.download()\narticle_sina.parse()\nprint(f\"\\nSina Article Title: {article_sina.title}\")\n\n\n## Why Use Newspaper3k?\n\nNewspaper3k stands out for several reasons, making it an excellent choice for news and article extraction tasks:\n\n*   **Robust Extraction**: It reliably extracts text, authors, publication dates, top images, and all images from HTML content.\n*   **NLP Capabilities**: Built-in Natural Language Processing allows for keyword and summary extraction, providing deeper insights into article content.\n*   **Multi-threaded Downloads**: Efficiently download multiple articles concurrently, speeding up data collection.\n*   **Multi-language Support**: Works in over 10 languages, with seamless auto-detection or explicit language specification, making it versatile for global news sources.\n*   **News URL Identification**: Smartly identifies news-related URLs, helping to focus your scraping efforts.\n*   **Ease of Use**: Its API is designed to be intuitive and straightforward, allowing developers to quickly integrate it into their projects.\n\n## Links\n\n*   **GitHub Repository**: [https://github.com/codelucas/newspaper](https://github.com/codelucas/newspaper){:target=\"_blank\"}\n*   **Official Documentation**: [https://newspaper.readthedocs.io](https://newspaper.readthedocs.io){:target=\"_blank\"}\n*   **Online Demo**: [http://newspaper-demo.herokuapp.com](http://newspaper-demo.herokuapp.com){:target=\"_blank\"}\n*   **Another Online Demo**: [http://newspaper.chinazt.cc/](http://newspaper.chinazt.cc/){:target=\"_blank\"}","metrics":{"detailViews":5,"githubClicks":7},"dates":{"published":null,"modified":"2025-10-13T15:00:52.000Z"}}