# sitefetch: Efficiently Scrape Websites for AI Model Training and Analysis

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Source: osrepos.com
Repository profile: https://osrepos.com/repo/egoist-sitefetch
Generated for open source discovery and AI-assisted research.

sitefetch is a powerful command-line utility designed to fetch and save entire websites as plain text files. This tool is particularly useful for preparing large datasets for AI model training, allowing easy consumption of web content. It offers flexible options for page matching and content selection, ensuring relevant data extraction.

GitHub: https://github.com/egoist/sitefetch
OSRepos URL: https://osrepos.com/repo/egoist-sitefetch

## Summary

sitefetch is a powerful command-line utility designed to fetch and save entire websites as plain text files. This tool is particularly useful for preparing large datasets for AI model training, allowing easy consumption of web content. It offers flexible options for page matching and content selection, ensuring relevant data extraction.

## Topics

- TypeScript
- Web Scraping
- AI
- CLI Tool
- Data Extraction
- LLM
- Web Content
- Automation

## Repository Information

Last analyzed by OSRepos: Sun Oct 12 2025 01:20:33 GMT+0100 (Western European Summer Time)
Detail views: 4
GitHub clicks: 7

## Safety Notice

OSRepos shares public repositories for knowledge and discovery only. Review source code, dependencies, licenses, and security implications before running or installing anything.

## Content

## Introdução

sitefetch, developed by egoist, is an innovative command-line utility designed to simplify the process of extracting content from entire websites. Its primary function is to fetch web pages and consolidate their readable content into a single, clean text file, making it exceptionally useful for preparing data for AI models and large language models (LLMs). This tool streamlines the often complex task of gathering web-based information, transforming it into an easily consumable format for various analytical and machine learning applications.

## Instalação

Getting started with sitefetch is straightforward. You can use it for one-off tasks without a global installation or install it globally for frequent use.

**One-off usage (choose one):**

bash
bunx sitefetch
npx sitefetch
pnpx sitefetch


**Install globally (choose one):**

bash
bun i -g sitefetch
npm i -g sitefetch
pnpm i -g sitefetch


## Exemplos

sitefetch provides flexible options to control what and how content is fetched, allowing for precise data extraction.

**Basic Usage:**

To fetch an entire site and save it to a file:

bash
sitefetch https://egoist.dev -o site.txt


**Improved Concurrency:**

For faster fetching of larger sites, you can specify a concurrency level:

bash
sitefetch https://egoist.dev -o site.txt --concurrency 10


**Match Specific Pages:**

Use the `-m, --match` flag to include only specific pages based on their pathnames. This feature is powered by [micromatch](https://github.com/micromatch/micromatch#matching-features){:target="_blank"}, offering powerful pattern matching capabilities.

bash
sitefetch https://vite.dev -m "/blog/**" -m "/guide/**"


**Content Selector:**

While sitefetch uses [mozilla/readability](https://github.com/mozilla/readability){:target="_blank"} for content extraction, you can specify a custom CSS selector with `--content-selector` if the default extraction is not optimal for a particular page.

bash
sitefetch https://vite.dev --content-selector ".content"


## Porquê usar

sitefetch stands out as an essential tool for anyone working with web data, particularly in the realm of artificial intelligence. Its ability to transform complex website structures into clean, readable text files significantly streamlines the data preparation phase for training AI models, fine-tuning LLMs, or performing large-scale text analysis. By offering features like page matching and content selectors, it ensures that you only extract the most relevant information, saving time and computational resources. This makes sitefetch an invaluable asset for researchers, developers, and data scientists looking to leverage web content efficiently.

## Links

Explore sitefetch further through these resources:

*   [GitHub Repository](https://github.com/egoist/sitefetch){:target="_blank"}
*   For programmatic use, sitefetch offers a simple API. Check out the [API documentation within the repository](https://github.com/egoist/sitefetch/blob/main/src/types.ts){:target="_blank"} for details on `fetchSite`.
*   Also, consider checking out egoist's LLM chat app: [Chatwise.app](https://chatwise.app){:target="_blank"}