Repository History

2 repositories tagged with data-pipelines

Topic: data-pipelines

Unstructured: Open-Source Pre-Processing for Complex Document Data

The `unstructured` library is an open-source ETL solution designed to convert complex, unstructured documents into clean, structured data. It streamlines the data processing workflow for language models, offering tools for ingesting and pre-processing various document types like PDFs, HTML, and Word documents. This library simplifies the transformation of raw information into formats suitable for advanced AI applications.

Analyzed Feb 10, 2026

View Details

DataTrove: Streamlining Large-Scale Data Processing for LLMs

DataTrove is a powerful Python library designed to simplify the complex task of processing, filtering, and deduplicating text data at a massive scale. It offers a collection of customizable, platform-agnostic pipeline blocks, making it ideal for preparing training data for large language models. With support for various execution environments, DataTrove frees developers from scripting madness, enabling efficient and reproducible data workflows.

Analyzed Jan 27, 2026

View Details

Previous Page 1 Next