Unstructured: Open-Source Pre-Processing for Complex Document Data

Summary
The `unstructured` library is an open-source ETL solution designed to convert complex, unstructured documents into clean, structured data. It streamlines the data processing workflow for language models, offering tools for ingesting and pre-processing various document types like PDFs, HTML, and Word documents. This library simplifies the transformation of raw information into formats suitable for advanced AI applications.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
The unstructured library is an open-source ETL (Extract, Transform, Load) solution designed to effortlessly convert complex, unstructured documents into clean, structured data. It provides robust tools for ingesting and pre-processing various document types, including PDFs, HTML files, Word documents, and many more, making them ready for use with large language models (LLMs) and other AI applications. unstructured aims to streamline and optimize the data processing workflow, offering modular functions and connectors that simplify data ingestion and transformation into structured outputs.
Installation
Getting started with unstructured is straightforward, with several flexible installation options:
- Using Docker: For a containerized environment, you can pull the latest
unstructuredimage and run it. This is ideal for quick setup without managing local dependencies.docker pull downloads.unstructured.io/unstructured-io/unstructured:latest docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest docker exec -it unstructured bash - Installing with pip: The Python SDK can be installed to support all document types or specific ones.
- For all document types:
pip install "unstructured[all-docs]" - For basic text, HTML, XML, JSON, and Emails (no extra dependencies):
pip install unstructured - For specific document types, e.g., Word and PowerPoint:
pip install "unstructured[docx,pptx]"
libmagic-dev,poppler-utils,tesseract-ocr, andlibreofficedepending on the document types you plan to process. - For all document types:
- Local Development: If you plan to contribute or develop locally,
unstructuredusesuvfor dependency management.
Refer to the official documentation for detailed instructions and platform-specific guidance.curl -LsSf https://astral.sh/uv/install.sh | sh make install
Examples
The unstructured library simplifies document parsing with its partition function, which automatically detects the file type and routes it to the appropriate parser. Here's an example of how to partition a PDF document:
from unstructured.partition.auto import partition
# Assuming 'example-docs/layout-parser-paper.pdf' is available
elements = partition("example-docs/layout-parser-paper.pdf")
print("\n\n".join([str(el) for el in elements]))
This code snippet will output a structured representation of the PDF content, breaking it down into elements like titles, paragraphs, and other textual components, making it easily consumable for further processing.
Why Use Unstructured?
unstructured stands out as a crucial tool for anyone working with large volumes of diverse document data, especially in the context of AI and LLMs. Its key advantages include:
- Effortless Data Transformation: Converts complex, unstructured documents into clean, structured formats with minimal effort.
- LLM Optimization: Specifically designed to prepare data for language models, improving their performance and accuracy.
- Broad Document Support: Handles a wide array of document types, from PDFs and Word documents to HTML and emails.
- Modular and Adaptable: Offers flexible components that can be integrated into various data pipelines and platforms.
- Open-Source Power: Benefits from community contributions and transparency, ensuring continuous improvement and innovation.
- Enterprise-Grade Capabilities: While open-source, it also has an enterprise platform offering advanced features like chunking, embedding, and image/table enrichment for production-grade workflows.
Links
- GitHub Repository: Unstructured-IO/unstructured
- Official Documentation: docs.unstructured.io
- Company Website: unstructured.io
- Join on Slack: Unstructured Slack Community
- LinkedIn: Unstructured.io on LinkedIn