{"name":"AnyCrawl: A High-Performance Node.js/TypeScript Web Crawler for LLM Data","description":"AnyCrawl is a powerful Node.js/TypeScript web crawler designed to transform websites into LLM-ready data. It excels at extracting structured SERP results from various search engines and features native multi-threading for efficient bulk processing, making it ideal for large-scale data collection.","github":"https://github.com/any4ai/AnyCrawl","url":"https://osrepos.com/repo/any4ai-anycrawl","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/any4ai-anycrawl","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/any4ai-anycrawl.md","json":"https://osrepos.com/repo/any4ai-anycrawl.json","topics":["web-crawling","data-extraction","llm-data","serp-scraping","typescript","nodejs","ai-tools","web-scraping"],"keywords":["web-crawling","data-extraction","llm-data","serp-scraping","typescript","nodejs","ai-tools","web-scraping"],"stars":null,"summary":"AnyCrawl is a powerful Node.js/TypeScript web crawler designed to transform websites into LLM-ready data. It excels at extracting structured SERP results from various search engines and features native multi-threading for efficient bulk processing, making it ideal for large-scale data collection.","content":"## Introduction\n\nAnyCrawl is a high-performance, Node.js/TypeScript web crawler and scraping toolkit designed to efficiently gather data from the web. It specializes in transforming raw website content into structured, LLM-ready data, making it an invaluable tool for AI development and data analysis. AnyCrawl supports various operations, including comprehensive site crawling, single-page web scraping, and structured SERP (Search Engine Results Page) data extraction from major search engines like Google. Its native multi-threading capabilities ensure fast and scalable processing for bulk tasks.\n\n## Installation\n\nGetting started with AnyCrawl is straightforward, especially using Docker Compose for self-hosting. This method simplifies deployment and setup.\n\nTo run AnyCrawl via Docker Compose:\n\nbash\ndocker compose up -d\n\n\nIf you enable authentication, you'll need to generate an API key. You can do this by executing a command within the running Docker container:\n\nbash\ndocker compose exec api pnpm --filter api key:generate -- default\n\n\nFor more detailed installation instructions and configuration options, refer to the [official documentation](https://docs.anycrawl.dev).\n\n## Examples\n\nAnyCrawl offers flexible APIs for different scraping needs. Here are a couple of examples demonstrating its power:\n\n### Web Scraping with LLM Extraction\n\nAnyCrawl can not only scrape web pages but also extract structured data using LLM-powered capabilities, based on a provided JSON schema.\n\nbash\ncurl -X POST \"https://api.anycrawl.dev/v1/scrape\" \\\n  -H \"Authorization: Bearer YOUR_ANYCRAWL_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"url\": \"https://example.com\",\n    \"json_options\": {\n      \"schema\": {\n        \"type\": \"object\",\n        \"properties\": {\n          \"company_mission\": { \"type\": \"string\" },\n          \"is_open_source\": { \"type\": \"boolean\" },\n          \"employee_count\": { \"type\": \"number\" }\n        },\n        \"required\": [\"company_mission\"]\n      }\n    }\n  }'\n\n\n### Search Engine Results (SERP)\n\nExtract structured search results from engines like Google with ease.\n\nbash\ncurl -X POST https://api.anycrawl.dev/v1/search \\\n  -H 'Content-Type: application/json' \\\n  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \\\n  -d '{\n  \"query\": \"AnyCrawl\",\n  \"limit\": 10,\n  \"engine\": \"google\",\n  \"lang\": \"all\"\n}'\n\n\nYou can test these APIs and generate code in your preferred language using the [AnyCrawl Playground](https://anycrawl.dev/playground).\n\n## Why Use AnyCrawl?\n\nAnyCrawl stands out for several reasons:\n\n*   **LLM-Ready Data**: It transforms raw HTML into clean, structured data optimized for Large Language Models, simplifying your AI workflows.\n*   **High Performance**: Leveraging native multi-threading and multi-process capabilities, AnyCrawl handles bulk tasks efficiently and reliably.\n*   **Versatile Scraping**: From full-site traversal to single-page content extraction and structured SERP results, it covers a wide range of web data needs.\n*   **Ease of Integration**: Built with Node.js and TypeScript, it's easy to integrate into existing projects and offers a clear API.\n*   **Scalability**: Designed for batch processing, it can scale to meet demanding data collection requirements.\n\n## Links\n\n*   **GitHub Repository**: [any4ai/AnyCrawl](https://github.com/any4ai/AnyCrawl)\n*   **Official Documentation**: [docs.anycrawl.dev](https://docs.anycrawl.dev)\n*   **API Playground**: [anycrawl.dev/playground](https://anycrawl.dev/playground)","metrics":{"detailViews":5,"githubClicks":8},"dates":{"published":null,"modified":"2025-10-12T06:20:38.000Z"}}