DeepScrape: Intelligent Web Scraping & LLM-Powered Data Extraction

Introduction

DeepScrape is an innovative, AI-powered web scraping solution that intelligently extracts structured data from any website. Built with Playwright for robust browser automation, it seamlessly integrates with Large Language Models (LLMs) to transform raw web content into actionable JSON. Whether you need to power modern web applications, enhance RAG pipelines, or streamline data workflows, DeepScrape offers a flexible and privacy-conscious approach. It supports both cloud-based LLMs like OpenAI and local models such as Ollama and vLLM, ensuring your data processing can remain entirely on-premises.

Key features include:

LLM Extraction: Convert web content to structured JSON using AI.
Batch Processing: Efficiently handle multiple URLs with controlled concurrency.
API-first: Easy integration via REST endpoints with API key security and Swagger documentation.
Browser Automation: Full Playwright support, including stealth mode.
Local LLM Support: Run entirely offline for complete data privacy.
Web Crawling: Multi-page crawling with configurable strategies.

Installation

Getting started with DeepScrape is straightforward. Follow these steps to set up the project locally:

Clone the repository:

git clone https://github.com/stretchcloud/deepscrape.git
cd deepscrape

Install dependencies:
```
npm install
```

Configure environment variables:

Copy the example environment file and edit it with your settings. You can choose between OpenAI or a local LLM provider like Ollama.

cp .env.example .env

Edit .env (example for OpenAI and Ollama):

API_KEY=your-secret-key

# Option 1: Use OpenAI (cloud)
LLM_PROVIDER=openai
OPENAI_API_KEY=your-openai-key

# Option 2: Use local model (e.g., Ollama)
# LLM_PROVIDER=ollama
# LLM_MODEL=llama3:latest

REDIS_HOST=localhost
CACHE_ENABLED=true

Start the server:
```
npm run dev
```
You can test if the server is running by visiting http://localhost:3000/health or using curl http://localhost:3000/health.

Examples

DeepScrape provides a powerful API for various scraping and extraction tasks. Here are some common use cases:

Basic Scraping

Scrape content from a single URL and output it as Markdown:

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "url": "https://example.com",
    "options": { "extractorFormat": "markdown" }
  }' | jq -r '.content' > content.md

Schema-Based Extraction

Extract structured data from a URL using a predefined JSON Schema:

curl -X POST http://localhost:3000/api/extract-schema \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "url": "https://news.example.com/article",
    "schema": {
      "type": "object",
      "properties": {
        "title": {
          "type": "string",
          "description": "Article headline"
        },
        "author": {
          "type": "string",
          "description": "Author name"
        },
        "publishDate": {
          "type": "string",
          "description": "Publication date"
        }
      },
      "required": ["title"]
    }
  }' | jq -r '.extractedData' > schemadata.md

Summarize URL Content

Generate a concise summary of a URL's content using an LLM:

curl -X POST http://localhost:3000/api/summarize \
  -H "Content-Type: application/json" \
  -H "X-API-Key: test-key" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/Large_language_model",
    "maxLength": 300,
    "options": {
      "temperature": 0.3,
      "waitForSelector": "body",
      "extractorFormat": "markdown"
    }
  }' | jq -r '.summary' > summary-output.md

Batch Processing

Process multiple URLs efficiently with controlled concurrency and download results as a ZIP archive:

Start Batch Processing:

curl -X POST http://localhost:3000/api/batch/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-key" \
  -d '{
        "urls": [
          "https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/quickstart",
          "https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/deploy-vais-prompt"
        ],
        "concurrency": 2,
        "options": {
          "extractorFormat": "markdown"
        }
      }'

(Note: The response will include a batchId and statusUrl.)

Monitor Batch Progress:

curl -X GET http://localhost:3000/api/batch/scrape/{batchId}/status \
      -H "X-API-Key: your-secret-key"

Download Results as ZIP Archive:

curl -X GET "http://localhost:3000/api/batch/scrape/{batchId}/download/zip?format=markdown" \
      -H "X-API-Key: your-secret-key" \
      --output "batch_results.zip"

Web Crawling

Initiate a multi-page crawl with configurable limits and strategies:

curl -X POST http://localhost:3000/api/crawl \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 10,
    "maxDepth": 2,
    "strategy": "bfs",
    "includePaths": ["^/docs/.*"],
    "scrapeOptions": {
      "extractorFormat": "markdown"
    }
  }'

(The response will include an id and outputDirectory.)

Why Use DeepScrape?

DeepScrape stands out as a comprehensive and flexible solution for web data extraction, offering several compelling advantages:

AI-Powered Precision: Leverage the power of LLMs for intelligent content understanding and structured data extraction, going beyond traditional CSS selectors.
Unmatched Flexibility: Choose between powerful cloud LLMs like OpenAI or maintain complete data privacy with local models such as Ollama, vLLM, and LocalAI.
Scalability and Efficiency: With built-in batch processing, job queues (BullMQ/Redis), and web crawling capabilities, DeepScrape can handle large-scale data collection efficiently.
Full Browser Automation: Utilize Playwright's capabilities for interacting with dynamic, JavaScript-heavy websites, including actions like clicks, scrolls, and waits.
Privacy-First Design: Process sensitive data entirely on your own infrastructure, ensuring no data leaves your network, which is crucial for compliance with regulations like GDPR or HIPAA.
Developer-Friendly API: An API-first design with clear REST endpoints and Swagger documentation makes integration into existing workflows seamless.
Docker Ready: Deploy DeepScrape quickly and easily using Docker or docker-compose.