{"name":"Trafilatura: Advanced Web Scraping and Text Extraction in Python","description":"Trafilatura is a robust Python package and command-line tool designed for gathering text and metadata from the web. It simplifies web crawling, scraping, and content extraction, transforming raw HTML into structured data. Widely adopted by major companies and institutions, it offers high efficiency and accuracy for various text processing needs.","github":"https://github.com/adbar/trafilatura","url":"https://osrepos.com/repo/adbar-trafilatura","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/adbar-trafilatura","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/adbar-trafilatura.md","json":"https://osrepos.com/repo/adbar-trafilatura.json","topics":["Python","Web Scraping","Text Extraction","NLP","Crawler","Data Mining","Article Extractor","HTML to Text"],"keywords":["Python","Web Scraping","Text Extraction","NLP","Crawler","Data Mining","Article Extractor","HTML to Text"],"stars":null,"summary":"Trafilatura is a robust Python package and command-line tool designed for gathering text and metadata from the web. It simplifies web crawling, scraping, and content extraction, transforming raw HTML into structured data. Widely adopted by major companies and institutions, it offers high efficiency and accuracy for various text processing needs.","content":"## Introduction\n\nTrafilatura is a cutting-edge **Python package and command-line tool** designed to **gather text and metadata from the Web**. It simplifies the process of turning raw HTML into structured, meaningful data, offering essential components for **web crawling, downloads, scraping, and extraction** of main texts, metadata, and comments. Recognized for its robustness and speed, Trafilatura is widely used by companies like HuggingFace, IBM, and Microsoft Research, as well as by renowned academic institutions.\n\n## Installation\n\nInstalling Trafilatura is straightforward and can be done using pip:\n\nbash\npip install trafilatura\n\n\n## Examples\n\nHere's how you can use Trafilatura to extract text from a URL or an HTML string:\n\npython\nimport trafilatura\nfrom trafilatura.downloads import fetch_url\n\n# Example 1: Extract from a URL\nurl = \"https://adrien.barbaresi.eu/blog/trafilatura-web-scraping.html\"\nprint(f\"Extracting from: {url}\")\ndownloaded = fetch_url(url)\nif downloaded:\n    text = trafilatura.extract(downloaded)\n    print(\"--- Extracted Content (first 500 characters) ---\")\n    print(text[:500])\n    print(\"----------------------------------------------------\")\n\n# Example 2: Extract from an HTML string\nhtml_content = \"\"\"\n<html>\n    <head><title>My Test Page</title></head>\n    <body>\n        <h1>Main Title</h1>\n        <p>This is an example paragraph with <b>bold text</b>.</p>\n        <ul>\n            <li>Item 1</li>\n            <li>Item 2</li>\n        </ul>\n    </body>\n</html>\n\"\"\"\nprint(\"\\nExtracting from an HTML string:\")\ntext_from_html = trafilatura.extract(html_content)\nprint(\"--- Extracted Content ---\")\nprint(text_from_html)\nprint(\"-------------------------\")\n\n\n## Why Use Trafilatura?\n\nTrafilatura stands out for several reasons, making it an excellent choice for your web content extraction needs:\n\n*   **Efficiency and Accuracy**: It consistently outperforms other open-source libraries in text extraction benchmarks, balancing noise limitation with the inclusion of all valid parts.\n*   **Comprehensive Features**: It offers advanced web crawling (sitemaps and feeds support), parallel processing, and robust extraction of main text, metadata (title, author, date), and formatting (paragraphs, titles, lists).\n*   **Multiple Output Formats**: It supports TXT, Markdown, CSV, JSON, HTML, XML, and TEI, providing flexibility for various applications.\n*   **Modularity and Ease of Use**: No database is required, and it's designed to be handy and modular, facilitating integration into your projects.\n*   **Active Maintenance and Community Support**: It receives regular updates, feature additions, and optimizations, backed by comprehensive documentation and an active community.\n*   **Focus on Content Quality**: It helps focus on the actual content, avoiding noise from recurring elements like headers and footers, and making sense of data and metadata.\n\n## Links\n\n*   **GitHub Repository**: [https://github.com/adbar/trafilatura](https://github.com/adbar/trafilatura){:target=\"_blank\"}\n*   **Official Documentation**: [https://trafilatura.readthedocs.io/](https://trafilatura.readthedocs.io/){:target=\"_blank\"}\n*   **PyPI Package**: [https://pypi.org/project/trafilatura/](https://pypi.org/project/trafilatura/){:target=\"_blank\"}","metrics":{"detailViews":1,"githubClicks":6},"dates":{"published":null,"modified":"2026-05-01T16:33:30.000Z"}}