Trafilatura: Advanced Web Scraping and Text Extraction in Python

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Trafilatura: Advanced Web Scraping and Text Extraction in Python

Summary

Trafilatura is a robust Python package and command-line tool designed for gathering text and metadata from the web. It simplifies web crawling, scraping, and content extraction, transforming raw HTML into structured data. Widely adopted by major companies and institutions, it offers high efficiency and accuracy for various text processing needs.

Repository Information

Analyzed by OSRepos on May 1, 2026

Topics

Click on any tag to explore related repositories

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

Trafilatura is a cutting-edge Python package and command-line tool designed to gather text and metadata from the Web. It simplifies the process of turning raw HTML into structured, meaningful data, offering essential components for web crawling, downloads, scraping, and extraction of main texts, metadata, and comments. Recognized for its robustness and speed, Trafilatura is widely used by companies like HuggingFace, IBM, and Microsoft Research, as well as by renowned academic institutions.

Installation

Installing Trafilatura is straightforward and can be done using pip:

pip install trafilatura

Examples

Here's how you can use Trafilatura to extract text from a URL or an HTML string:

import trafilatura
from trafilatura.downloads import fetch_url

# Example 1: Extract from a URL
url = "https://adrien.barbaresi.eu/blog/trafilatura-web-scraping.html"
print(f"Extracting from: {url}")
downloaded = fetch_url(url)
if downloaded:
    text = trafilatura.extract(downloaded)
    print("--- Extracted Content (first 500 characters) ---")
    print(text[:500])
    print("----------------------------------------------------")

# Example 2: Extract from an HTML string
html_content = """

    My Test Page
    
        

Main Title

This is an example paragraph with <b>bold text</b>.

  • Item 1
  • Item 2
""" print("\nExtracting from an HTML string:") text_from_html = trafilatura.extract(html_content) print("--- Extracted Content ---") print(text_from_html) print("-------------------------")

Why Use Trafilatura?

Trafilatura stands out for several reasons, making it an excellent choice for your web content extraction needs:

  • Efficiency and Accuracy: It consistently outperforms other open-source libraries in text extraction benchmarks, balancing noise limitation with the inclusion of all valid parts.
  • Comprehensive Features: It offers advanced web crawling (sitemaps and feeds support), parallel processing, and robust extraction of main text, metadata (title, author, date), and formatting (paragraphs, titles, lists).
  • Multiple Output Formats: It supports TXT, Markdown, CSV, JSON, HTML, XML, and TEI, providing flexibility for various applications.
  • Modularity and Ease of Use: No database is required, and it's designed to be handy and modular, facilitating integration into your projects.
  • Active Maintenance and Community Support: It receives regular updates, feature additions, and optimizations, backed by comprehensive documentation and an active community.
  • Focus on Content Quality: It helps focus on the actual content, avoiding noise from recurring elements like headers and footers, and making sense of data and metadata.

Links

Related repositories

Similar repositories that may be relevant next.

Jsonformer: Bulletproof Structured JSON Generation from Language Models

Jsonformer: Bulletproof Structured JSON Generation from Language Models

June 27, 2026

Jsonformer is a powerful library designed to generate syntactically correct and schema-conforming JSON from language models. It addresses the common challenge of unreliable JSON output by focusing on generating only content tokens, making the process more efficient and robust. This approach ensures bulletproof structured data generation for various applications.

JSONLanguage ModelsAI
JailbreakEval: An Integrated Toolkit for Evaluating LLM Jailbreak Attempts

JailbreakEval: An Integrated Toolkit for Evaluating LLM Jailbreak Attempts

June 26, 2026

JailbreakEval is an award-winning collection of automated evaluators designed to assess jailbreak attempts against large language models. It addresses the impracticality of manual inspection for large-scale analysis by unifying various evaluation tools. This toolkit is invaluable for both jailbreak researchers and evaluator developers, offering a robust framework for creating and benchmarking new evaluators.

llm-jailbreaksllm-safetyPython
EasyJailbreak: A Python Framework for Adversarial LLM Jailbreak Prompts

EasyJailbreak: A Python Framework for Adversarial LLM Jailbreak Prompts

June 26, 2026

EasyJailbreak is an intuitive Python framework designed for generating adversarial jailbreak prompts for Large Language Models (LLMs). It provides a structured approach to decompose the jailbreaking process into iterative steps, offering components for mutation, attack, and evaluation. This tool is ideal for researchers and developers focused on LLM security and understanding model vulnerabilities.

PythonJailbreakLLM Security
Guardrails: Enhancing LLM Reliability and Structured Data Generation

Guardrails: Enhancing LLM Reliability and Structured Data Generation

June 26, 2026

Guardrails is a Python framework designed to build reliable AI applications by adding guardrails to large language models. It helps detect, quantify, and mitigate risks in LLM inputs/outputs, and facilitates the generation of structured data. This framework ensures more predictable and safer interactions with AI models.

aifoundation-modelllm

Source repository

Open the original repository on GitHub.

6 counted GitHub visits

View on GitHub
OS
OSRepos

Analysis and discovery of open source repositories. Find interesting projects and follow their updates.

Monitor your website with YourWebsiteScore

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of third-party repository code is at your own risk. Always review source code, dependencies, licenses, and security implications before running anything.

© 2025 OSRepos. Built with Nuxt 3 and lots of ❤️