Repository History

Explore all analyzed open source repositories

Topic: data-extraction

Awesome-crawler: A Curated List of Web Crawlers and Spiders

Awesome-crawler is an extensive GitHub repository that curates a collection of web crawling and scraping tools across various programming languages. This resource is invaluable for developers looking for efficient solutions to extract data from the web. It provides a comprehensive overview of popular frameworks and libraries, making it easier to choose the right tool for any web scraping project.

Mar 1, 2026

View Details

Scraperr: A Powerful Self-Hosted Web Scraping Solution

Scraperr is a powerful self-hosted web scraping solution that allows users to extract data from websites without writing a single line of code. It features XPath-based extraction, queue management, domain spidering, and various data export options. This tool provides a comprehensive platform for efficient and controlled web data collection.

Feb 16, 2026

View Details

pdfplumber: Extracting Data from PDFs with Ease and Precision

pdfplumber is a powerful Python library designed to extract detailed information from PDFs, including characters, rectangles, and lines. It excels at easily extracting text and tables, making it an invaluable tool for data analysis and automation. Built on pdfminer.six, it provides robust PDF parsing capabilities.

Jan 24, 2026

View Details

brightdata-mcp: Empowering AI with Real-time Web Access and Data Scraping

The brightdata-mcp is a powerful Model Context Protocol (MCP) server developed by Bright Data, designed to give AI agents real-time web access. It provides an all-in-one solution for seamless public web interaction, ensuring Large Language Models (LLMs) can access live information without encountering blocks or CAPTCHAs. This open-source project offers robust web scraping, browser automation, and data extraction capabilities.

Jan 22, 2026

View Details

AnyCrawl: A High-Performance Node.js/TypeScript Web Crawler for LLM Data

AnyCrawl is a powerful Node.js/TypeScript web crawler designed to transform websites into LLM-ready data. It excels at extracting structured SERP results from various search engines and features native multi-threading for efficient bulk processing, making it ideal for large-scale data collection.

Oct 12, 2025

View Details

Page 1