OmniParse: Ingest, Parse, and Optimize Data for GenAI Frameworks

Introduction

OmniParse is an innovative platform that streamlines the ingestion and parsing of diverse unstructured data, transforming it into structured, actionable formats optimized for Generative AI (GenAI) and Large Language Model (LLM) applications. Whether you're dealing with documents, tables, images, videos, audio files, or web pages, OmniParse prepares your data to be clean, structured, and ready for advanced AI applications such as Retrieval Augmented Generation (RAG) and fine-tuning.

This project aims to provide a comprehensive solution for data preparation, addressing the inherent challenges of processing data that comes in various shapes and sizes. By offering a unified ingestion and parsing platform, OmniParse ensures that your data is always in the best possible state for your AI workflows.

Installation

Important Note: The OmniParse server currently operates exclusively on Linux-based systems due to specific dependencies and configurations. It is not compatible with Windows or macOS.

To get started with OmniParse, follow these steps:

Clone the Repository:

git clone https://github.com/adithya-s-k/omniparse
cd omniparse

Create a Virtual Environment:

conda create -n omniparse-venv python=3.10
conda activate omniparse-venv

Install Dependencies:

poetry install
# or
pip install -e .
# or
pip install -r pyproject.toml

Docker Deployment

For an easier deployment, OmniParse can be run using Docker:

Pull the Docker Image:
```
docker pull savatar101/omniparse:0.1
```

Run the Docker Container:

With GPU support:

docker run --gpus all -p 8000:8000 savatar101/omniparse:0.1

Without GPU support:

docker run -p 8000:8000 savatar101/omniparse:0.1

Alternatively, you can build the Docker image locally:

Build the Docker Image:
```
docker build -t omniparse .
```

Run the Docker Container:

With GPU support:

docker run --gpus all -p 8000:8000 omniparse

Without GPU support:
```
docker run -p 8000:8000 omniparse
```

Examples

Once installed, you can run the OmniParse server and interact with its API.

Running the Server

Start the server with desired model support:

python server.py --host 0.0.0.0 --port 8000 --documents --media --web

--documents: Loads models for parsing documents (Surya OCR, Florence-2).
--media: Loads Whisper model for audio and video transcription.
--web: Sets up a Selenium crawler for web parsing.

Downloading Models (Optional)

You can pre-download models before starting the server:

python download.py --documents --media --web

Supported Data Types

OmniParse supports a wide range of data types:

Type	Supported Extensions
Documents	.doc, .docx, .pdf, .ppt, .pptx
Images	.png, .jpg, .jpeg, .tiff, .bmp, .heic
Video	.mp4, .mkv, .avi, .mov
Audio	.mp3, .wav, .aac
Web	dynamic webpages, http://<anything>.com

API Endpoints

OmniParse provides various API endpoints for different parsing tasks. Here are a few examples:

Parse Any Document:

curl -X POST -F "file=@/path/to/document" http://localhost:8000/parse_document

Parse Image:

curl -X POST -F "file=@/path/to/image.jpg" http://localhost:8000/parse_media/image

Process Image with Task (e.g., Captioning):

curl -X POST -F "image=@/path/to/image.jpg" -F "task=Caption" -F "prompt=Optional prompt" http://localhost:8000/parse_media/process_image

Parse Website:

curl -X POST -H "Content-Type: application/json" -d '{"url": "https://example.com"}' http://localhost:8000/parse_website

Why Use OmniParse?

OmniParse stands out as a crucial tool for anyone working with GenAI applications due to its robust features and clear mission:

Local Processing: Operates completely locally, eliminating reliance on external APIs and enhancing data privacy and security.
Resource Efficient: Designed to fit within a T4 GPU, making it accessible for many setups.
Broad Compatibility: Supports approximately 20 different file types, ensuring versatility across various data sources.
High-Quality Output: Converts documents, multimedia, and web pages into structured, high-quality markdown, ideal for AI consumption.
Comprehensive Extraction: Features include advanced table extraction, image extraction and captioning, audio and video transcription, and robust web page crawling capabilities.
Easy Deployment: Can be easily deployed using Docker and Skypilot, and is friendly for Google Colab environments.
Interactive UI: Comes with an interactive user interface powered by Gradio for ease of use.

OmniParse addresses the significant challenge of unifying disparate data formats into a single, AI-ready structure, making it an invaluable asset for developing and deploying GenAI solutions.

OmniParse: Ingest, Parse, and Optimize Data for GenAI Frameworks

Summary

Repository Information

Topics

Use at your own risk

Introduction

Installation

Docker Deployment

Examples

Running the Server

Downloading Models (Optional)

Supported Data Types

API Endpoints

Why Use OmniParse?

Links

Related repositories

Django Channels: Developer-Friendly Asynchrony for Django

StringWars: Benchmarking High-Performance String Processing in Rust and Python

awesome-python-books: A Curated Directory of Python Books for All Levels

Become-A-Full-Stack-Web-Developer: Free Resources for Web Development

Source repository