Repository History
Explore all analyzed open source repositories

Unstructured: Open-Source Pre-Processing for Complex Document Data
The `unstructured` library is an open-source ETL solution designed to convert complex, unstructured documents into clean, structured data. It streamlines the data processing workflow for language models, offering tools for ingesting and pre-processing various document types like PDFs, HTML, and Word documents. This library simplifies the transformation of raw information into formats suitable for advanced AI applications.

DataTrove: Streamlining Large-Scale Data Processing for LLMs
DataTrove is a powerful Python library designed to simplify the complex task of processing, filtering, and deduplicating text data at a massive scale. It offers a collection of customizable, platform-agnostic pipeline blocks, making it ideal for preparing training data for large language models. With support for various execution environments, DataTrove frees developers from scripting madness, enabling efficient and reproducible data workflows.