{"name":"sitefetch: Efficiently Scrape Websites for AI Model Training and Analysis","description":"sitefetch is a powerful command-line utility designed to fetch and save entire websites as plain text files. This tool is particularly useful for preparing large datasets for AI model training, allowing easy consumption of web content. It offers flexible options for page matching and content selection, ensuring relevant data extraction.","github":"https://github.com/egoist/sitefetch","url":"https://osrepos.com/repo/egoist-sitefetch","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/egoist-sitefetch","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/egoist-sitefetch.md","json":"https://osrepos.com/repo/egoist-sitefetch.json","topics":["TypeScript","Web Scraping","AI","CLI Tool","Data Extraction","LLM","Web Content","Automation"],"keywords":["TypeScript","Web Scraping","AI","CLI Tool","Data Extraction","LLM","Web Content","Automation"],"stars":null,"summary":"sitefetch is a powerful command-line utility designed to fetch and save entire websites as plain text files. This tool is particularly useful for preparing large datasets for AI model training, allowing easy consumption of web content. It offers flexible options for page matching and content selection, ensuring relevant data extraction.","content":"## Introdução\n\nsitefetch, developed by egoist, is an innovative command-line utility designed to simplify the process of extracting content from entire websites. Its primary function is to fetch web pages and consolidate their readable content into a single, clean text file, making it exceptionally useful for preparing data for AI models and large language models (LLMs). This tool streamlines the often complex task of gathering web-based information, transforming it into an easily consumable format for various analytical and machine learning applications.\n\n## Instalação\n\nGetting started with sitefetch is straightforward. You can use it for one-off tasks without a global installation or install it globally for frequent use.\n\n**One-off usage (choose one):**\n\nbash\nbunx sitefetch\nnpx sitefetch\npnpx sitefetch\n\n\n**Install globally (choose one):**\n\nbash\nbun i -g sitefetch\nnpm i -g sitefetch\npnpm i -g sitefetch\n\n\n## Exemplos\n\nsitefetch provides flexible options to control what and how content is fetched, allowing for precise data extraction.\n\n**Basic Usage:**\n\nTo fetch an entire site and save it to a file:\n\nbash\nsitefetch https://egoist.dev -o site.txt\n\n\n**Improved Concurrency:**\n\nFor faster fetching of larger sites, you can specify a concurrency level:\n\nbash\nsitefetch https://egoist.dev -o site.txt --concurrency 10\n\n\n**Match Specific Pages:**\n\nUse the `-m, --match` flag to include only specific pages based on their pathnames. This feature is powered by [micromatch](https://github.com/micromatch/micromatch#matching-features){:target=\"_blank\"}, offering powerful pattern matching capabilities.\n\nbash\nsitefetch https://vite.dev -m \"/blog/**\" -m \"/guide/**\"\n\n\n**Content Selector:**\n\nWhile sitefetch uses [mozilla/readability](https://github.com/mozilla/readability){:target=\"_blank\"} for content extraction, you can specify a custom CSS selector with `--content-selector` if the default extraction is not optimal for a particular page.\n\nbash\nsitefetch https://vite.dev --content-selector \".content\"\n\n\n## Porquê usar\n\nsitefetch stands out as an essential tool for anyone working with web data, particularly in the realm of artificial intelligence. Its ability to transform complex website structures into clean, readable text files significantly streamlines the data preparation phase for training AI models, fine-tuning LLMs, or performing large-scale text analysis. By offering features like page matching and content selectors, it ensures that you only extract the most relevant information, saving time and computational resources. This makes sitefetch an invaluable asset for researchers, developers, and data scientists looking to leverage web content efficiently.\n\n## Links\n\nExplore sitefetch further through these resources:\n\n*   [GitHub Repository](https://github.com/egoist/sitefetch){:target=\"_blank\"}\n*   For programmatic use, sitefetch offers a simple API. Check out the [API documentation within the repository](https://github.com/egoist/sitefetch/blob/main/src/types.ts){:target=\"_blank\"} for details on `fetchSite`.\n*   Also, consider checking out egoist's LLM chat app: [Chatwise.app](https://chatwise.app){:target=\"_blank\"}","metrics":{"detailViews":4,"githubClicks":7},"dates":{"published":null,"modified":"2025-10-12T00:20:33.000Z"}}