DataTrove: Streamlining Large-Scale Data Processing for LLMs
This repository profile is provided by osrepos.com, an open source repository discovery platform.

Summary
DataTrove is a powerful Python library designed to simplify the complex task of processing, filtering, and deduplicating text data at a massive scale. It offers a collection of customizable, platform-agnostic pipeline blocks, making it ideal for preparing training data for large language models. With support for various execution environments, DataTrove frees developers from scripting madness, enabling efficient and reproducible data workflows.
Repository Information
Topics
Click on any tag to explore related repositories
Use at your own risk
OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.
Introduction
DataTrove is a powerful Python library from Hugging Face designed to streamline the complex process of handling vast amounts of text data. It aims to free data processing from "scripting madness" by offering a robust set of platform-agnostic, customizable pipeline processing blocks. Whether you're preparing training data for large language models (LLMs) or performing extensive data cleaning, DataTrove provides the tools to build efficient and scalable data workflows. It supports various file systems through fsspec, ensuring flexibility in data input and output.
Installation
Getting started with DataTrove is straightforward. You can install it using pip, with optional "flavours" to include specific dependencies for different functionalities:
pip install datatrove[FLAVOUR]
Available flavours include all, io (for various file formats), processing (for text extraction, filtering, tokenization), s3 (for S3 support), cli (for command-line tools), and ray (for distributed compute). You can combine them, for example: pip install datatrove[processing,s3].
Examples
DataTrove comes with several practical examples demonstrating its capabilities for common large-scale data tasks:
- FineWeb Dataset Reproduction: A complete script to reproduce the FineWeb dataset. See
fineweb.py. - Common Crawl Processing: A full pipeline to read Common Crawl WARC files, extract text, filter, and save to S3, runnable on Slurm. See
process_common_crawl_dump.py. - C4 Dataset Tokenization: Reads data directly from Hugging Face Hub to tokenize the English portion of the C4 dataset using the
gpt2tokenizer. Seetokenize_c4.py. - Text Deduplication: Examples for various deduplication techniques, including MinHash, sentence-level exact deduplication, and exact substrings. See
minhash_deduplication.pyandsentence_deduplication.py.
Why Use DataTrove?
DataTrove offers compelling advantages for anyone dealing with large text datasets:
- Scalability and Performance: Designed for very large workloads, it features low memory usage and supports distributed execution across local machines, Slurm clusters, and Ray clusters. Its task-based execution model allows for efficient parallelization.
- Flexible and Modular Pipelines: Build custom data processing pipelines using a wide array of prebuilt blocks for reading, writing, extracting, filtering, deduplicating, and collecting statistics. You can easily extend it with your own custom functions or blocks.
- Robustness and Reproducibility: DataTrove tracks completed tasks, enabling automatic resumption of jobs from the last successful checkpoint. This ensures resilience against failures and promotes reproducible data processing workflows.
- Advanced Synthetic Data Generation: The library includes powerful inference capabilities, supporting vLLM, SGLang, and OpenAI-compatible endpoints for generating synthetic data at scale, complete with checkpointing and progress monitoring.
- Comprehensive Data Insights: Utilize integrated statistics blocks to collect detailed data profiles, offering valuable insights into your dataset's characteristics in a distributed manner.
Links
- GitHub Repository: Explore the source code, contribute, and stay updated on the project's development:
huggingface/datatrove - Citation: If you use DataTrove in your research or projects, please consider citing it:
@misc{penedo2024datatrove, author = {Penedo, Guilherme and Kydlí?ek, Hynek and Cappelli, Alessandro and Sasko, Mario and Wolf, Thomas}, title = {DataTrove: large scale data processing}, year = {2024}, publisher = {GitHub}, journal = {GitHub repository}, url = {https://github.com/huggingface/datatrove} }
Related repositories
Similar repositories that may be relevant next.

torchchat: Run PyTorch LLMs Locally on Servers, Desktop, and Mobile
July 3, 2026
torchchat is a PyTorch-native codebase designed to showcase the ability to run large language models (LLMs) seamlessly across various platforms. It enables local execution of LLMs using Python, within C/C++ applications on desktop or servers, and directly on iOS and Android devices. Although no longer under active development, it remains a valuable resource for understanding and implementing local LLM deployment strategies.

Docling: Streamline Document Processing for Generative AI Applications
July 3, 2026
Docling is a powerful Python library designed to simplify document processing, preparing diverse formats for generative AI applications. It offers advanced parsing capabilities, including sophisticated PDF understanding, and provides a unified document representation. With seamless integrations into the AI ecosystem, Docling empowers developers to build robust AI solutions.

DeepFabric: High-Quality Synthetic Data for Agentic AI Systems
July 2, 2026
DeepFabric is an open-source Python library designed to generate high-quality synthetic training data for language models and agent evaluations. It excels at creating domain-specific datasets that teach models to think, plan, and act effectively, including correct tool usage and adherence to schema structures. This comprehensive pipeline also integrates training and evaluation capabilities, ensuring robust model development.
OpenMontage: The First Open-Source, Agentic Video Production System
June 29, 2026
OpenMontage is the world's first open-source, agentic video production system, designed to transform your AI coding assistant into a full video production studio. It features 12 pipelines, 52 tools, and over 500 agent skills, enabling end-to-end video creation from a simple prompt. This powerful tool handles research, scripting, asset generation, editing, and final composition, including the unique ability to produce real video from stock footage.
Source repository
Open the original repository on GitHub.
6 counted GitHub visits