DataTrove: Streamlining Large-Scale Data Processing for LLMs

Summary

DataTrove is a powerful Python library designed to simplify the complex task of processing, filtering, and deduplicating text data at a massive scale. It offers a collection of customizable, platform-agnostic pipeline blocks, making it ideal for preparing training data for large language models. With support for various execution environments, DataTrove frees developers from scripting madness, enabling efficient and reproducible data workflows.

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

DataTrove is a powerful Python library from Hugging Face designed to streamline the complex process of handling vast amounts of text data. It aims to free data processing from "scripting madness" by offering a robust set of platform-agnostic, customizable pipeline processing blocks. Whether you're preparing training data for large language models (LLMs) or performing extensive data cleaning, DataTrove provides the tools to build efficient and scalable data workflows. It supports various file systems through fsspec, ensuring flexibility in data input and output.

Installation

Getting started with DataTrove is straightforward. You can install it using pip, with optional "flavours" to include specific dependencies for different functionalities:

pip install datatrove[FLAVOUR]

Available flavours include all, io (for various file formats), processing (for text extraction, filtering, tokenization), s3 (for S3 support), cli (for command-line tools), and ray (for distributed compute). You can combine them, for example: pip install datatrove[processing,s3].

Examples

DataTrove comes with several practical examples demonstrating its capabilities for common large-scale data tasks:

FineWeb Dataset Reproduction: A complete script to reproduce the FineWeb dataset. See fineweb.py.
Common Crawl Processing: A full pipeline to read Common Crawl WARC files, extract text, filter, and save to S3, runnable on Slurm. See process_common_crawl_dump.py.
C4 Dataset Tokenization: Reads data directly from Hugging Face Hub to tokenize the English portion of the C4 dataset using the gpt2 tokenizer. See tokenize_c4.py.
Text Deduplication: Examples for various deduplication techniques, including MinHash, sentence-level exact deduplication, and exact substrings. See minhash_deduplication.py and sentence_deduplication.py.

Why Use DataTrove?

DataTrove offers compelling advantages for anyone dealing with large text datasets:

Scalability and Performance: Designed for very large workloads, it features low memory usage and supports distributed execution across local machines, Slurm clusters, and Ray clusters. Its task-based execution model allows for efficient parallelization.
Flexible and Modular Pipelines: Build custom data processing pipelines using a wide array of prebuilt blocks for reading, writing, extracting, filtering, deduplicating, and collecting statistics. You can easily extend it with your own custom functions or blocks.
Robustness and Reproducibility: DataTrove tracks completed tasks, enabling automatic resumption of jobs from the last successful checkpoint. This ensures resilience against failures and promotes reproducible data processing workflows.
Advanced Synthetic Data Generation: The library includes powerful inference capabilities, supporting vLLM, SGLang, and OpenAI-compatible endpoints for generating synthetic data at scale, complete with checkpointing and progress monitoring.
Comprehensive Data Insights: Utilize integrated statistics blocks to collect detailed data profiles, offering valuable insights into your dataset's characteristics in a distributed manner.

Links

GitHub Repository: Explore the source code, contribute, and stay updated on the project's development: huggingface/datatrove

Citation: If you use DataTrove in your research or projects, please consider citing it:

@misc{penedo2024datatrove,
  author = {Penedo, Guilherme and Kydlí?ek, Hynek and Cappelli, Alessandro and Sasko, Mario and Wolf, Thomas},
  title = {DataTrove: large scale data processing},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  url = {https://github.com/huggingface/datatrove}
}