{"name":"DataTrove: Streamlining Large-Scale Data Processing for LLMs","description":"DataTrove is a powerful Python library designed to simplify the complex task of processing, filtering, and deduplicating text data at a massive scale. It offers a collection of customizable, platform-agnostic pipeline blocks, making it ideal for preparing training data for large language models. With support for various execution environments, DataTrove frees developers from scripting madness, enabling efficient and reproducible data workflows.","github":"https://github.com/huggingface/datatrove","url":"https://osrepos.com/repo/huggingface-datatrove","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/huggingface-datatrove","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/huggingface-datatrove.md","json":"https://osrepos.com/repo/huggingface-datatrove.json","topics":["python","data-processing","llm-training","text-deduplication","data-pipelines","distributed-computing","ai","nlp"],"keywords":["python","data-processing","llm-training","text-deduplication","data-pipelines","distributed-computing","ai","nlp"],"stars":null,"summary":"DataTrove is a powerful Python library designed to simplify the complex task of processing, filtering, and deduplicating text data at a massive scale. It offers a collection of customizable, platform-agnostic pipeline blocks, making it ideal for preparing training data for large language models. With support for various execution environments, DataTrove frees developers from scripting madness, enabling efficient and reproducible data workflows.","content":"## Introduction\n\nDataTrove is a powerful Python library from Hugging Face designed to streamline the complex process of handling vast amounts of text data. It aims to free data processing from \"scripting madness\" by offering a robust set of platform-agnostic, customizable pipeline processing blocks. Whether you're preparing training data for large language models (LLMs) or performing extensive data cleaning, DataTrove provides the tools to build efficient and scalable data workflows. It supports various file systems through `fsspec`, ensuring flexibility in data input and output.\n\n## Installation\n\nGetting started with DataTrove is straightforward. You can install it using pip, with optional \"flavours\" to include specific dependencies for different functionalities:\n\nbash\npip install datatrove[FLAVOUR]\n\n\nAvailable flavours include `all`, `io` (for various file formats), `processing` (for text extraction, filtering, tokenization), `s3` (for S3 support), `cli` (for command-line tools), and `ray` (for distributed compute). You can combine them, for example: `pip install datatrove[processing,s3]`.\n\n## Examples\n\nDataTrove comes with several practical examples demonstrating its capabilities for common large-scale data tasks:\n\n*   **FineWeb Dataset Reproduction:** A complete script to reproduce the [FineWeb dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb){target=\"_blank\"}. See [`fineweb.py`](https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py){target=\"_blank\"}.\n*   **Common Crawl Processing:** A full pipeline to read Common Crawl WARC files, extract text, filter, and save to S3, runnable on Slurm. See [`process_common_crawl_dump.py`](https://github.com/huggingface/datatrove/blob/main/examples/process_common_crawl_dump.py){target=\"_blank\"}.\n*   **C4 Dataset Tokenization:** Reads data directly from Hugging Face Hub to tokenize the English portion of the C4 dataset using the `gpt2` tokenizer. See [`tokenize_c4.py`](https://github.com/huggingface/datatrove/blob/main/examples/tokenize_c4.py){target=\"_blank\"}.\n*   **Text Deduplication:** Examples for various deduplication techniques, including MinHash, sentence-level exact deduplication, and exact substrings. See [`minhash_deduplication.py`](https://github.com/huggingface/datatrove/blob/main/examples/minhash_deduplication.py){target=\"_blank\"} and [`sentence_deduplication.py`](https://github.com/huggingface/datatrove/blob/main/examples/sentence_deduplication.py){target=\"_blank\"}.\n\n## Why Use DataTrove?\n\nDataTrove offers compelling advantages for anyone dealing with large text datasets:\n\n*   **Scalability and Performance:** Designed for very large workloads, it features low memory usage and supports distributed execution across local machines, Slurm clusters, and Ray clusters. Its task-based execution model allows for efficient parallelization.\n*   **Flexible and Modular Pipelines:** Build custom data processing pipelines using a wide array of prebuilt blocks for reading, writing, extracting, filtering, deduplicating, and collecting statistics. You can easily extend it with your own custom functions or blocks.\n*   **Robustness and Reproducibility:** DataTrove tracks completed tasks, enabling automatic resumption of jobs from the last successful checkpoint. This ensures resilience against failures and promotes reproducible data processing workflows.\n*   **Advanced Synthetic Data Generation:** The library includes powerful inference capabilities, supporting vLLM, SGLang, and OpenAI-compatible endpoints for generating synthetic data at scale, complete with checkpointing and progress monitoring.\n*   **Comprehensive Data Insights:** Utilize integrated statistics blocks to collect detailed data profiles, offering valuable insights into your dataset's characteristics in a distributed manner.\n\n## Links\n\n*   **GitHub Repository:** Explore the source code, contribute, and stay updated on the project's development: [`huggingface/datatrove`](https://github.com/huggingface/datatrove){target=\"_blank\"}\n*   **Citation:** If you use DataTrove in your research or projects, please consider citing it:\n    bibtex\n@misc{penedo2024datatrove,\n  author = {Penedo, Guilherme and Kydlí?ek, Hynek and Cappelli, Alessandro and Sasko, Mario and Wolf, Thomas},\n  title = {DataTrove: large scale data processing},\n  year = {2024},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  url = {https://github.com/huggingface/datatrove}\n}","metrics":{"detailViews":8,"githubClicks":6},"dates":{"published":null,"modified":"2026-01-27T00:00:29.000Z"}}