DeepScrape: Intelligent Web Scraping & LLM-Powered Data Extraction
This repository profile is provided by osrepos.com, an open source repository discovery platform.

Summary
DeepScrape is an AI-powered web scraping tool designed for intelligent data extraction using LLMs. It leverages Playwright for browser automation and supports both cloud (OpenAI) and local LLMs (Ollama, vLLM) for transforming web content into structured JSON. This versatile tool is ideal for modern web applications, RAG pipelines, and various data workflows, offering privacy-first data processing.
Repository Information
Topics
Click on any tag to explore related repositories
Use at your own risk
OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.
Introduction
DeepScrape is an innovative, AI-powered web scraping solution that intelligently extracts structured data from any website. Built with Playwright for robust browser automation, it seamlessly integrates with Large Language Models (LLMs) to transform raw web content into actionable JSON. Whether you need to power modern web applications, enhance RAG pipelines, or streamline data workflows, DeepScrape offers a flexible and privacy-conscious approach. It supports both cloud-based LLMs like OpenAI and local models such as Ollama and vLLM, ensuring your data processing can remain entirely on-premises.
Key features include:
- LLM Extraction: Convert web content to structured JSON using AI.
- Batch Processing: Efficiently handle multiple URLs with controlled concurrency.
- API-first: Easy integration via REST endpoints with API key security and Swagger documentation.
- Browser Automation: Full Playwright support, including stealth mode.
- Local LLM Support: Run entirely offline for complete data privacy.
- Web Crawling: Multi-page crawling with configurable strategies.
Installation
Getting started with DeepScrape is straightforward. Follow these steps to set up the project locally:
- Clone the repository:
git clone https://github.com/stretchcloud/deepscrape.git cd deepscrape - Install dependencies:
npm install - Configure environment variables:
Copy the example environment file and edit it with your settings. You can choose between OpenAI or a local LLM provider like Ollama.
cp .env.example .envEdit
.env(example for OpenAI and Ollama):API_KEY=your-secret-key # Option 1: Use OpenAI (cloud) LLM_PROVIDER=openai OPENAI_API_KEY=your-openai-key # Option 2: Use local model (e.g., Ollama) # LLM_PROVIDER=ollama # LLM_MODEL=llama3:latest REDIS_HOST=localhost CACHE_ENABLED=true - Start the server:
npm run devYou can test if the server is running by visiting
http://localhost:3000/healthor usingcurl http://localhost:3000/health.
Examples
DeepScrape provides a powerful API for various scraping and extraction tasks. Here are some common use cases:
Basic Scraping
Scrape content from a single URL and output it as Markdown:
curl -X POST http://localhost:3000/api/scrape \
-H "Content-Type: application/json" \
-H "X-API-Key: your-secret-key" \
-d '{
"url": "https://example.com",
"options": { "extractorFormat": "markdown" }
}' | jq -r '.content' > content.md
Schema-Based Extraction
Extract structured data from a URL using a predefined JSON Schema:
curl -X POST http://localhost:3000/api/extract-schema \
-H "Content-Type: application/json" \
-H "X-API-Key: your-secret-key" \
-d '{
"url": "https://news.example.com/article",
"schema": {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Article headline"
},
"author": {
"type": "string",
"description": "Author name"
},
"publishDate": {
"type": "string",
"description": "Publication date"
}
},
"required": ["title"]
}
}' | jq -r '.extractedData' > schemadata.md
Summarize URL Content
Generate a concise summary of a URL's content using an LLM:
curl -X POST http://localhost:3000/api/summarize \
-H "Content-Type: application/json" \
-H "X-API-Key: test-key" \
-d '{
"url": "https://en.wikipedia.org/wiki/Large_language_model",
"maxLength": 300,
"options": {
"temperature": 0.3,
"waitForSelector": "body",
"extractorFormat": "markdown"
}
}' | jq -r '.summary' > summary-output.md
Batch Processing
Process multiple URLs efficiently with controlled concurrency and download results as a ZIP archive:
- Start Batch Processing:
curl -X POST http://localhost:3000/api/batch/scrape \ -H "Content-Type: application/json" \ -H "X-API-Key: your-secret-key" \ -d '{ "urls": [ "https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/quickstart", "https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/deploy-vais-prompt" ], "concurrency": 2, "options": { "extractorFormat": "markdown" } }'(Note: The response will include a
batchIdandstatusUrl.) - Monitor Batch Progress:
curl -X GET http://localhost:3000/api/batch/scrape/{batchId}/status \ -H "X-API-Key: your-secret-key" - Download Results as ZIP Archive:
curl -X GET "http://localhost:3000/api/batch/scrape/{batchId}/download/zip?format=markdown" \ -H "X-API-Key: your-secret-key" \ --output "batch_results.zip"
Web Crawling
Initiate a multi-page crawl with configurable limits and strategies:
curl -X POST http://localhost:3000/api/crawl \
-H "Content-Type: application/json" \
-H "X-API-Key: your-secret-key" \
-d '{
"url": "https://docs.example.com",
"limit": 10,
"maxDepth": 2,
"strategy": "bfs",
"includePaths": ["^/docs/.*"],
"scrapeOptions": {
"extractorFormat": "markdown"
}
}'
(The response will include an id and outputDirectory.)
Why Use DeepScrape?
DeepScrape stands out as a comprehensive and flexible solution for web data extraction, offering several compelling advantages:
- AI-Powered Precision: Leverage the power of LLMs for intelligent content understanding and structured data extraction, going beyond traditional CSS selectors.
- Unmatched Flexibility: Choose between powerful cloud LLMs like OpenAI or maintain complete data privacy with local models such as Ollama, vLLM, and LocalAI.
- Scalability and Efficiency: With built-in batch processing, job queues (BullMQ/Redis), and web crawling capabilities, DeepScrape can handle large-scale data collection efficiently.
- Full Browser Automation: Utilize Playwright's capabilities for interacting with dynamic, JavaScript-heavy websites, including actions like clicks, scrolls, and waits.
- Privacy-First Design: Process sensitive data entirely on your own infrastructure, ensuring no data leaves your network, which is crucial for compliance with regulations like GDPR or HIPAA.
- Developer-Friendly API: An API-first design with clear REST endpoints and Swagger documentation makes integration into existing workflows seamless.
- Docker Ready: Deploy DeepScrape quickly and easily using Docker or
docker-compose.
Links
- GitHub Repository: https://github.com/stretchcloud/deepscrape
- License: Apache 2.0
Related repositories
Similar repositories that may be relevant next.
Newspaper3k: Advanced News and Article Extraction in Python
October 13, 2025
Newspaper3k is a powerful Python 3 library designed for news, full-text, and article metadata extraction. Inspired by the simplicity of 'requests' and the speed of 'lxml', it provides robust tools for scraping and curating articles from various sources. This library is ideal for developers needing to programmatically gather and process news content with advanced NLP capabilities.

Pipet: A Swiss-Army Tool for Web Scraping and Data Extraction
October 12, 2025
Pipet is a versatile command-line web scraper designed for hackers, enabling efficient data extraction from various online assets. It supports HTML parsing, JSON parsing, and client-side JavaScript evaluation, leveraging existing tools like `curl` and `playwright` for powerful and flexible scraping operations. This tool is ideal for tracking information, monitoring changes, and automating data collection tasks.
Source repository
Open the original repository on GitHub.
6 counted GitHub visits