AnyCrawl: A High-Performance Node.js/TypeScript Web Crawler for LLM Data

Introduction

AnyCrawl is a high-performance, Node.js/TypeScript web crawler and scraping toolkit designed to efficiently gather data from the web. It specializes in transforming raw website content into structured, LLM-ready data, making it an invaluable tool for AI development and data analysis. AnyCrawl supports various operations, including comprehensive site crawling, single-page web scraping, and structured SERP (Search Engine Results Page) data extraction from major search engines like Google. Its native multi-threading capabilities ensure fast and scalable processing for bulk tasks.

Installation

Getting started with AnyCrawl is straightforward, especially using Docker Compose for self-hosting. This method simplifies deployment and setup.

To run AnyCrawl via Docker Compose:

docker compose up -d

If you enable authentication, you'll need to generate an API key. You can do this by executing a command within the running Docker container:

docker compose exec api pnpm --filter api key:generate -- default

For more detailed installation instructions and configuration options, refer to the official documentation.

Examples

AnyCrawl offers flexible APIs for different scraping needs. Here are a couple of examples demonstrating its power:

Web Scraping with LLM Extraction

AnyCrawl can not only scrape web pages but also extract structured data using LLM-powered capabilities, based on a provided JSON schema.

curl -X POST "https://api.anycrawl.dev/v1/scrape" \
  -H "Authorization: Bearer YOUR_ANYCRAWL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "json_options": {
      "schema": {
        "type": "object",
        "properties": {
          "company_mission": { "type": "string" },
          "is_open_source": { "type": "boolean" },
          "employee_count": { "type": "number" }
        },
        "required": ["company_mission"]
      }
    }
  }'

Search Engine Results (SERP)

Extract structured search results from engines like Google with ease.

curl -X POST https://api.anycrawl.dev/v1/search \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "query": "AnyCrawl",
  "limit": 10,
  "engine": "google",
  "lang": "all"
}'

You can test these APIs and generate code in your preferred language using the AnyCrawl Playground.

Why Use AnyCrawl?

AnyCrawl stands out for several reasons:

LLM-Ready Data: It transforms raw HTML into clean, structured data optimized for Large Language Models, simplifying your AI workflows.
High Performance: Leveraging native multi-threading and multi-process capabilities, AnyCrawl handles bulk tasks efficiently and reliably.
Versatile Scraping: From full-site traversal to single-page content extraction and structured SERP results, it covers a wide range of web data needs.
Ease of Integration: Built with Node.js and TypeScript, it's easy to integrate into existing projects and offers a clear API.
Scalability: Designed for batch processing, it can scale to meet demanding data collection requirements.

AnyCrawl: A High-Performance Node.js/TypeScript Web Crawler for LLM Data

Summary

Repository Information

Topics

Use at your own risk