{"name":"DeepScrape: Intelligent Web Scraping & LLM-Powered Data Extraction","description":"DeepScrape is an AI-powered web scraping tool designed for intelligent data extraction using LLMs. It leverages Playwright for browser automation and supports both cloud (OpenAI) and local LLMs (Ollama, vLLM) for transforming web content into structured JSON. This versatile tool is ideal for modern web applications, RAG pipelines, and various data workflows, offering privacy-first data processing.","github":"https://github.com/stretchcloud/deepscrape","url":"https://osrepos.com/repo/stretchcloud-deepscrape","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/stretchcloud-deepscrape","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/stretchcloud-deepscrape.md","json":"https://osrepos.com/repo/stretchcloud-deepscrape.json","topics":["web scraping","LLM","AI","data extraction","TypeScript","Playwright","automation","API"],"keywords":["web scraping","LLM","AI","data extraction","TypeScript","Playwright","automation","API"],"stars":null,"summary":"DeepScrape is an AI-powered web scraping tool designed for intelligent data extraction using LLMs. It leverages Playwright for browser automation and supports both cloud (OpenAI) and local LLMs (Ollama, vLLM) for transforming web content into structured JSON. This versatile tool is ideal for modern web applications, RAG pipelines, and various data workflows, offering privacy-first data processing.","content":"## Introduction\n\nDeepScrape is an innovative, AI-powered web scraping solution that intelligently extracts structured data from any website. Built with Playwright for robust browser automation, it seamlessly integrates with Large Language Models (LLMs) to transform raw web content into actionable JSON. Whether you need to power modern web applications, enhance RAG pipelines, or streamline data workflows, DeepScrape offers a flexible and privacy-conscious approach. It supports both cloud-based LLMs like OpenAI and local models such as Ollama and vLLM, ensuring your data processing can remain entirely on-premises.\n\nKey features include:\n*   **LLM Extraction**: Convert web content to structured JSON using AI.\n*   **Batch Processing**: Efficiently handle multiple URLs with controlled concurrency.\n*   **API-first**: Easy integration via REST endpoints with API key security and Swagger documentation.\n*   **Browser Automation**: Full Playwright support, including stealth mode.\n*   **Local LLM Support**: Run entirely offline for complete data privacy.\n*   **Web Crawling**: Multi-page crawling with configurable strategies.\n\n## Installation\n\nGetting started with DeepScrape is straightforward. Follow these steps to set up the project locally:\n\n1.  **Clone the repository:**\n    bash\n    git clone https://github.com/stretchcloud/deepscrape.git\n    cd deepscrape\n    \n\n2.  **Install dependencies:**\n    bash\n    npm install\n    \n\n3.  **Configure environment variables:**\n    Copy the example environment file and edit it with your settings. You can choose between OpenAI or a local LLM provider like Ollama.\n    bash\n    cp .env.example .env\n    \n    Edit `.env` (example for OpenAI and Ollama):\n    env\n    API_KEY=your-secret-key\n\n    # Option 1: Use OpenAI (cloud)\n    LLM_PROVIDER=openai\n    OPENAI_API_KEY=your-openai-key\n\n    # Option 2: Use local model (e.g., Ollama)\n    # LLM_PROVIDER=ollama\n    # LLM_MODEL=llama3:latest\n\n    REDIS_HOST=localhost\n    CACHE_ENABLED=true\n    \n\n4.  **Start the server:**\n    bash\n    npm run dev\n    \n    You can test if the server is running by visiting `http://localhost:3000/health` or using `curl http://localhost:3000/health`.\n\n## Examples\n\nDeepScrape provides a powerful API for various scraping and extraction tasks. Here are some common use cases:\n\n### Basic Scraping\n\nScrape content from a single URL and output it as Markdown:\n\nbash\ncurl -X POST http://localhost:3000/api/scrape \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-API-Key: your-secret-key\" \\\n  -d '{\n    \"url\": \"https://example.com\",\n    \"options\": { \"extractorFormat\": \"markdown\" }\n  }' | jq -r '.content' > content.md\n\n\n### Schema-Based Extraction\n\nExtract structured data from a URL using a predefined JSON Schema:\n\nbash\ncurl -X POST http://localhost:3000/api/extract-schema \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-API-Key: your-secret-key\" \\\n  -d '{\n    \"url\": \"https://news.example.com/article\",\n    \"schema\": {\n      \"type\": \"object\",\n      \"properties\": {\n        \"title\": {\n          \"type\": \"string\",\n          \"description\": \"Article headline\"\n        },\n        \"author\": {\n          \"type\": \"string\",\n          \"description\": \"Author name\"\n        },\n        \"publishDate\": {\n          \"type\": \"string\",\n          \"description\": \"Publication date\"\n        }\n      },\n      \"required\": [\"title\"]\n    }\n  }' | jq -r '.extractedData' > schemadata.md\n\n\n### Summarize URL Content\n\nGenerate a concise summary of a URL's content using an LLM:\n\nbash\ncurl -X POST http://localhost:3000/api/summarize \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-API-Key: test-key\" \\\n  -d '{\n    \"url\": \"https://en.wikipedia.org/wiki/Large_language_model\",\n    \"maxLength\": 300,\n    \"options\": {\n      \"temperature\": 0.3,\n      \"waitForSelector\": \"body\",\n      \"extractorFormat\": \"markdown\"\n    }\n  }' | jq -r '.summary' > summary-output.md\n\n\n### Batch Processing\n\nProcess multiple URLs efficiently with controlled concurrency and download results as a ZIP archive:\n\n1.  **Start Batch Processing:**\n    bash\n    curl -X POST http://localhost:3000/api/batch/scrape \\\n      -H \"Content-Type: application/json\" \\\n      -H \"X-API-Key: your-secret-key\" \\\n      -d '{\n        \"urls\": [\n          \"https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/quickstart\",\n          \"https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/deploy-vais-prompt\"\n        ],\n        \"concurrency\": 2,\n        \"options\": {\n          \"extractorFormat\": \"markdown\"\n        }\n      }'\n    \n    (Note: The response will include a `batchId` and `statusUrl`.)\n\n2.  **Monitor Batch Progress:**\n    bash\n    curl -X GET http://localhost:3000/api/batch/scrape/{batchId}/status \\\n      -H \"X-API-Key: your-secret-key\"\n    \n\n3.  **Download Results as ZIP Archive:**\n    bash\n    curl -X GET \"http://localhost:3000/api/batch/scrape/{batchId}/download/zip?format=markdown\" \\\n      -H \"X-API-Key: your-secret-key\" \\\n      --output \"batch_results.zip\"\n    \n\n### Web Crawling\n\nInitiate a multi-page crawl with configurable limits and strategies:\n\nbash\ncurl -X POST http://localhost:3000/api/crawl \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-API-Key: your-secret-key\" \\\n  -d '{\n    \"url\": \"https://docs.example.com\",\n    \"limit\": 10,\n    \"maxDepth\": 2,\n    \"strategy\": \"bfs\",\n    \"includePaths\": [\"^/docs/.*\"],\n    \"scrapeOptions\": {\n      \"extractorFormat\": \"markdown\"\n    }\n  }'\n\n(The response will include an `id` and `outputDirectory`.)\n\n## Why Use DeepScrape?\n\nDeepScrape stands out as a comprehensive and flexible solution for web data extraction, offering several compelling advantages:\n\n*   **AI-Powered Precision**: Leverage the power of LLMs for intelligent content understanding and structured data extraction, going beyond traditional CSS selectors.\n*   **Unmatched Flexibility**: Choose between powerful cloud LLMs like OpenAI or maintain complete data privacy with local models such as Ollama, vLLM, and LocalAI.\n*   **Scalability and Efficiency**: With built-in batch processing, job queues (BullMQ/Redis), and web crawling capabilities, DeepScrape can handle large-scale data collection efficiently.\n*   **Full Browser Automation**: Utilize Playwright's capabilities for interacting with dynamic, JavaScript-heavy websites, including actions like clicks, scrolls, and waits.\n*   **Privacy-First Design**: Process sensitive data entirely on your own infrastructure, ensuring no data leaves your network, which is crucial for compliance with regulations like GDPR or HIPAA.\n*   **Developer-Friendly API**: An API-first design with clear REST endpoints and Swagger documentation makes integration into existing workflows seamless.\n*   **Docker Ready**: Deploy DeepScrape quickly and easily using Docker or `docker-compose`.\n\n## Links\n\n*   **GitHub Repository**: <https://github.com/stretchcloud/deepscrape>\n*   **License**: <https://github.com/stretchcloud/deepscrape/blob/main/LICENSE>","metrics":{"detailViews":8,"githubClicks":6},"dates":{"published":null,"modified":"2025-12-19T20:01:00.000Z"}}