# DeepScrape: Intelligent Web Scraping & LLM-Powered Data Extraction

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Source: osrepos.com
Repository profile: https://osrepos.com/repo/stretchcloud-deepscrape
Generated for open source discovery and AI-assisted research.

DeepScrape is an AI-powered web scraping tool designed for intelligent data extraction using LLMs. It leverages Playwright for browser automation and supports both cloud (OpenAI) and local LLMs (Ollama, vLLM) for transforming web content into structured JSON. This versatile tool is ideal for modern web applications, RAG pipelines, and various data workflows, offering privacy-first data processing.

GitHub: https://github.com/stretchcloud/deepscrape
OSRepos URL: https://osrepos.com/repo/stretchcloud-deepscrape

## Summary

DeepScrape is an AI-powered web scraping tool designed for intelligent data extraction using LLMs. It leverages Playwright for browser automation and supports both cloud (OpenAI) and local LLMs (Ollama, vLLM) for transforming web content into structured JSON. This versatile tool is ideal for modern web applications, RAG pipelines, and various data workflows, offering privacy-first data processing.

## Topics

- web scraping
- LLM
- AI
- data extraction
- TypeScript
- Playwright
- automation
- API

## Repository Information

Last analyzed by OSRepos: Fri Dec 19 2025 20:01:00 GMT+0000 (Western European Standard Time)
Detail views: 8
GitHub clicks: 6

## Safety Notice

OSRepos shares public repositories for knowledge and discovery only. Review source code, dependencies, licenses, and security implications before running or installing anything.

## Content

## Introduction

DeepScrape is an innovative, AI-powered web scraping solution that intelligently extracts structured data from any website. Built with Playwright for robust browser automation, it seamlessly integrates with Large Language Models (LLMs) to transform raw web content into actionable JSON. Whether you need to power modern web applications, enhance RAG pipelines, or streamline data workflows, DeepScrape offers a flexible and privacy-conscious approach. It supports both cloud-based LLMs like OpenAI and local models such as Ollama and vLLM, ensuring your data processing can remain entirely on-premises.

Key features include:
*   **LLM Extraction**: Convert web content to structured JSON using AI.
*   **Batch Processing**: Efficiently handle multiple URLs with controlled concurrency.
*   **API-first**: Easy integration via REST endpoints with API key security and Swagger documentation.
*   **Browser Automation**: Full Playwright support, including stealth mode.
*   **Local LLM Support**: Run entirely offline for complete data privacy.
*   **Web Crawling**: Multi-page crawling with configurable strategies.

## Installation

Getting started with DeepScrape is straightforward. Follow these steps to set up the project locally:

1.  **Clone the repository:**
    bash
    git clone https://github.com/stretchcloud/deepscrape.git
    cd deepscrape
    

2.  **Install dependencies:**
    bash
    npm install
    

3.  **Configure environment variables:**
    Copy the example environment file and edit it with your settings. You can choose between OpenAI or a local LLM provider like Ollama.
    bash
    cp .env.example .env
    
    Edit `.env` (example for OpenAI and Ollama):
    env
    API_KEY=your-secret-key

    # Option 1: Use OpenAI (cloud)
    LLM_PROVIDER=openai
    OPENAI_API_KEY=your-openai-key

    # Option 2: Use local model (e.g., Ollama)
    # LLM_PROVIDER=ollama
    # LLM_MODEL=llama3:latest

    REDIS_HOST=localhost
    CACHE_ENABLED=true
    

4.  **Start the server:**
    bash
    npm run dev
    
    You can test if the server is running by visiting `http://localhost:3000/health` or using `curl http://localhost:3000/health`.

## Examples

DeepScrape provides a powerful API for various scraping and extraction tasks. Here are some common use cases:

### Basic Scraping

Scrape content from a single URL and output it as Markdown:

bash
curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "url": "https://example.com",
    "options": { "extractorFormat": "markdown" }
  }' | jq -r '.content' > content.md


### Schema-Based Extraction

Extract structured data from a URL using a predefined JSON Schema:

bash
curl -X POST http://localhost:3000/api/extract-schema \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "url": "https://news.example.com/article",
    "schema": {
      "type": "object",
      "properties": {
        "title": {
          "type": "string",
          "description": "Article headline"
        },
        "author": {
          "type": "string",
          "description": "Author name"
        },
        "publishDate": {
          "type": "string",
          "description": "Publication date"
        }
      },
      "required": ["title"]
    }
  }' | jq -r '.extractedData' > schemadata.md


### Summarize URL Content

Generate a concise summary of a URL's content using an LLM:

bash
curl -X POST http://localhost:3000/api/summarize \
  -H "Content-Type: application/json" \
  -H "X-API-Key: test-key" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/Large_language_model",
    "maxLength": 300,
    "options": {
      "temperature": 0.3,
      "waitForSelector": "body",
      "extractorFormat": "markdown"
    }
  }' | jq -r '.summary' > summary-output.md


### Batch Processing

Process multiple URLs efficiently with controlled concurrency and download results as a ZIP archive:

1.  **Start Batch Processing:**
    bash
    curl -X POST http://localhost:3000/api/batch/scrape \
      -H "Content-Type: application/json" \
      -H "X-API-Key: your-secret-key" \
      -d '{
        "urls": [
          "https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/quickstart",
          "https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/deploy-vais-prompt"
        ],
        "concurrency": 2,
        "options": {
          "extractorFormat": "markdown"
        }
      }'
    
    (Note: The response will include a `batchId` and `statusUrl`.)

2.  **Monitor Batch Progress:**
    bash
    curl -X GET http://localhost:3000/api/batch/scrape/{batchId}/status \
      -H "X-API-Key: your-secret-key"
    

3.  **Download Results as ZIP Archive:**
    bash
    curl -X GET "http://localhost:3000/api/batch/scrape/{batchId}/download/zip?format=markdown" \
      -H "X-API-Key: your-secret-key" \
      --output "batch_results.zip"
    

### Web Crawling

Initiate a multi-page crawl with configurable limits and strategies:

bash
curl -X POST http://localhost:3000/api/crawl \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 10,
    "maxDepth": 2,
    "strategy": "bfs",
    "includePaths": ["^/docs/.*"],
    "scrapeOptions": {
      "extractorFormat": "markdown"
    }
  }'

(The response will include an `id` and `outputDirectory`.)

## Why Use DeepScrape?

DeepScrape stands out as a comprehensive and flexible solution for web data extraction, offering several compelling advantages:

*   **AI-Powered Precision**: Leverage the power of LLMs for intelligent content understanding and structured data extraction, going beyond traditional CSS selectors.
*   **Unmatched Flexibility**: Choose between powerful cloud LLMs like OpenAI or maintain complete data privacy with local models such as Ollama, vLLM, and LocalAI.
*   **Scalability and Efficiency**: With built-in batch processing, job queues (BullMQ/Redis), and web crawling capabilities, DeepScrape can handle large-scale data collection efficiently.
*   **Full Browser Automation**: Utilize Playwright's capabilities for interacting with dynamic, JavaScript-heavy websites, including actions like clicks, scrolls, and waits.
*   **Privacy-First Design**: Process sensitive data entirely on your own infrastructure, ensuring no data leaves your network, which is crucial for compliance with regulations like GDPR or HIPAA.
*   **Developer-Friendly API**: An API-first design with clear REST endpoints and Swagger documentation makes integration into existing workflows seamless.
*   **Docker Ready**: Deploy DeepScrape quickly and easily using Docker or `docker-compose`.

## Links

*   **GitHub Repository**: <https://github.com/stretchcloud/deepscrape>
*   **License**: <https://github.com/stretchcloud/deepscrape/blob/main/LICENSE>