sitefetch: Efficiently Scrape Websites for AI Model Training and Analysis
This repository profile is provided by osrepos.com, an open source repository discovery platform.

Summary
sitefetch is a powerful command-line utility designed to fetch and save entire websites as plain text files. This tool is particularly useful for preparing large datasets for AI model training, allowing easy consumption of web content. It offers flexible options for page matching and content selection, ensuring relevant data extraction.
Repository Information
Topics
Click on any tag to explore related repositories
Use at your own risk
OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.
Introdução
sitefetch, developed by egoist, is an innovative command-line utility designed to simplify the process of extracting content from entire websites. Its primary function is to fetch web pages and consolidate their readable content into a single, clean text file, making it exceptionally useful for preparing data for AI models and large language models (LLMs). This tool streamlines the often complex task of gathering web-based information, transforming it into an easily consumable format for various analytical and machine learning applications.
Instalação
Getting started with sitefetch is straightforward. You can use it for one-off tasks without a global installation or install it globally for frequent use.
One-off usage (choose one):
bunx sitefetch
npx sitefetch
pnpx sitefetch
Install globally (choose one):
bun i -g sitefetch
npm i -g sitefetch
pnpm i -g sitefetch
Exemplos
sitefetch provides flexible options to control what and how content is fetched, allowing for precise data extraction.
Basic Usage:
To fetch an entire site and save it to a file:
sitefetch https://egoist.dev -o site.txt
Improved Concurrency:
For faster fetching of larger sites, you can specify a concurrency level:
sitefetch https://egoist.dev -o site.txt --concurrency 10
Match Specific Pages:
Use the -m, --match flag to include only specific pages based on their pathnames. This feature is powered by micromatch, offering powerful pattern matching capabilities.
sitefetch https://vite.dev -m "/blog/**" -m "/guide/**"
Content Selector:
While sitefetch uses mozilla/readability for content extraction, you can specify a custom CSS selector with --content-selector if the default extraction is not optimal for a particular page.
sitefetch https://vite.dev --content-selector ".content"
Porquê usar
sitefetch stands out as an essential tool for anyone working with web data, particularly in the realm of artificial intelligence. Its ability to transform complex website structures into clean, readable text files significantly streamlines the data preparation phase for training AI models, fine-tuning LLMs, or performing large-scale text analysis. By offering features like page matching and content selectors, it ensures that you only extract the most relevant information, saving time and computational resources. This makes sitefetch an invaluable asset for researchers, developers, and data scientists looking to leverage web content efficiently.
Links
Explore sitefetch further through these resources:
- GitHub Repository
- For programmatic use, sitefetch offers a simple API. Check out the API documentation within the repository for details on
fetchSite. - Also, consider checking out egoist's LLM chat app: Chatwise.app
Related repositories
Similar repositories that may be relevant next.

Llama Cloud Services: Knowledge Agents and Management in the Cloud
July 3, 2026
Llama Cloud Services offers tools for building knowledge agents and managing data in the cloud. It provides robust capabilities for parsing various document types, including PDF, DOCX, and PPTX, into structured formats. Users should note that this repository is deprecated, with migration recommended to the new `llama-cloud` packages for continued support and improved performance.

FreeLLMAPI: Stack 16 Free LLM Tiers for 1.7 Billion Tokens/Month
June 27, 2026
FreeLLMAPI is an OpenAI-compatible proxy that aggregates the free tiers of 16 LLM providers, offering access to approximately 1.7 billion tokens per month. It simplifies access to diverse models through a single endpoint, featuring smart routing, automatic failover, and encrypted key storage. This powerful tool is designed for personal experimentation, allowing developers to leverage multiple free LLM resources efficiently.

Voicebox: The Open-Source AI Voice Studio for Cloning and Dictation
June 25, 2026
Voicebox is an innovative open-source AI voice studio that allows users to clone voices, generate speech in multiple languages, and dictate into any application. It provides a comprehensive, local-first voice I/O stack, offering a powerful alternative to cloud-based solutions. This tool ensures complete privacy and control over your voice data, running entirely on your local machine.

EasyWhisperUI: A Cross-Platform Desktop App for Whisper Model Transcription
June 22, 2026
EasyWhisperUI is a fast, local desktop application designed for transcribing audio and video using the Whisper model. It offers GPU acceleration across Windows, macOS, and Linux, providing a user-friendly interface for various transcription tasks. The application supports features like live transcription, batch processing, and translation, making it a versatile tool for media processing.
Source repository
Open the original repository on GitHub.
7 counted GitHub visits