Explore all analyzed open source repositories
Data Prep Kit is an open-source project designed to accelerate unstructured data preparation for GenAI and LLM applications. It provides a comprehensive set of modules and transforms to cleanse, transform, and enrich data for pre-training, fine-tuning, instruct-tuning LLMs, or building Retrieval Augmented Generation (RAG) applications. The kit is highly scalable, supporting processing from a laptop to data center scale using Python, Ray, and Spark runtimes.