Scraperr: A Powerful Self-Hosted Web Scraping Solution

Summary

Scraperr is a powerful self-hosted web scraping solution that allows users to extract data from websites without writing a single line of code. It features XPath-based extraction, queue management, domain spidering, and various data export options. This tool provides a comprehensive platform for efficient and controlled web data collection.

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

Scraperr is an open-source, self-hosted web scraper designed to simplify data extraction from websites. It eliminates the need for coding, offering an intuitive interface to define and manage scraping jobs. Built with modern technologies like TypeScript, FastAPI, Next.js, and MongoDB, Scraperr provides a robust and scalable solution for various web scraping needs. Key features include precise XPath-based element targeting, queue management for multiple jobs, domain spidering, custom headers, media downloads, and structured results visualization.

Installation

Getting Scraperr up and running is straightforward, with primary deployment options via Docker and Helm.

Docker

For a quick setup using Docker, navigate to the project directory and run the following command:

make up

This command will orchestrate the necessary services to launch Scraperr.

Helm

For Kubernetes deployments, Scraperr provides Helm charts. Detailed instructions for Helm deployment can be found in the official documentation:

Refer to the docs for Helm deployment

Examples

Scraperr empowers users to scrape websites without writing any code. Once deployed, you can access its web interface to configure scraping tasks. Users can define scraping jobs by specifying URLs and using XPath expressions to precisely target and extract desired data elements. The tool supports advanced features like scraping all pages within the same domain (domain spidering) and automatically downloading images, videos, and other media linked on the pages. After a job completes, the scraped data is presented in a structured table format within the interface, ready for review and export in markdown or CSV formats.

Why Use Scraperr?

Scraperr stands out as an excellent choice for web scraping due to several compelling reasons:

No-Code Scraping: Extract data efficiently without writing a single line of code, making it accessible to a broader audience.
Self-Hosted Control: Maintain full control over your scraping infrastructure and data, ensuring privacy and compliance.
Powerful Features: Benefit from XPath-based extraction, queue management, domain spidering, custom headers, and media downloads.
Data Visualization & Export: Easily view scraped data in a structured table and export it in convenient formats like markdown and CSV.
Ethical Guidelines: The project emphasizes responsible scraping practices, encouraging users to respect robots.txt, terms of service, and rate limiting.