OmniParse: Ingest, Parse, and Optimize Data for GenAI Frameworks

Summary
OmniParse is a powerful platform designed to ingest, parse, and optimize any unstructured data, from documents to multimedia, into structured, actionable formats. It enhances compatibility with GenAI frameworks, preparing data for applications like RAG and fine-tuning. This tool simplifies the complex process of data preparation for AI, making it accessible and efficient.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
OmniParse is an innovative platform that streamlines the ingestion and parsing of diverse unstructured data, transforming it into structured, actionable formats optimized for Generative AI (GenAI) and Large Language Model (LLM) applications. Whether you're dealing with documents, tables, images, videos, audio files, or web pages, OmniParse prepares your data to be clean, structured, and ready for advanced AI applications such as Retrieval Augmented Generation (RAG) and fine-tuning.
This project aims to provide a comprehensive solution for data preparation, addressing the inherent challenges of processing data that comes in various shapes and sizes. By offering a unified ingestion and parsing platform, OmniParse ensures that your data is always in the best possible state for your AI workflows.
Installation
Important Note: The OmniParse server currently operates exclusively on Linux-based systems due to specific dependencies and configurations. It is not compatible with Windows or macOS.
To get started with OmniParse, follow these steps:
Clone the Repository:
git clone https://github.com/adithya-s-k/omniparse cd omniparseCreate a Virtual Environment:
conda create -n omniparse-venv python=3.10 conda activate omniparse-venvInstall Dependencies:
poetry install # or pip install -e . # or pip install -r pyproject.toml
Docker Deployment
For an easier deployment, OmniParse can be run using Docker:
Pull the Docker Image:
docker pull savatar101/omniparse:0.1Run the Docker Container:
With GPU support:
docker run --gpus all -p 8000:8000 savatar101/omniparse:0.1Without GPU support:
docker run -p 8000:8000 savatar101/omniparse:0.1
Alternatively, you can build the Docker image locally:
Build the Docker Image:
docker build -t omniparse .Run the Docker Container:
With GPU support:
docker run --gpus all -p 8000:8000 omniparseWithout GPU support:
docker run -p 8000:8000 omniparse
Examples
Once installed, you can run the OmniParse server and interact with its API.
Running the Server
Start the server with desired model support:
python server.py --host 0.0.0.0 --port 8000 --documents --media --web
--documents: Loads models for parsing documents (Surya OCR, Florence-2).--media: Loads Whisper model for audio and video transcription.--web: Sets up a Selenium crawler for web parsing.
Downloading Models (Optional)
You can pre-download models before starting the server:
python download.py --documents --media --web
Supported Data Types
OmniParse supports a wide range of data types:
| Type | Supported Extensions |
|---|---|
| Documents | .doc, .docx, .pdf, .ppt, .pptx |
| Images | .png, .jpg, .jpeg, .tiff, .bmp, .heic |
| Video | .mp4, .mkv, .avi, .mov |
| Audio | .mp3, .wav, .aac |
| Web | dynamic webpages, http://<anything>.com |
API Endpoints
OmniParse provides various API endpoints for different parsing tasks. Here are a few examples:
Parse Any Document:
curl -X POST -F "file=@/path/to/document" http://localhost:8000/parse_documentParse Image:
curl -X POST -F "file=@/path/to/image.jpg" http://localhost:8000/parse_media/imageProcess Image with Task (e.g., Captioning):
curl -X POST -F "image=@/path/to/image.jpg" -F "task=Caption" -F "prompt=Optional prompt" http://localhost:8000/parse_media/process_imageParse Website:
curl -X POST -H "Content-Type: application/json" -d '{"url": "https://example.com"}' http://localhost:8000/parse_website
Why Use OmniParse?
OmniParse stands out as a crucial tool for anyone working with GenAI applications due to its robust features and clear mission:
- Local Processing: Operates completely locally, eliminating reliance on external APIs and enhancing data privacy and security.
- Resource Efficient: Designed to fit within a T4 GPU, making it accessible for many setups.
- Broad Compatibility: Supports approximately 20 different file types, ensuring versatility across various data sources.
- High-Quality Output: Converts documents, multimedia, and web pages into structured, high-quality markdown, ideal for AI consumption.
- Comprehensive Extraction: Features include advanced table extraction, image extraction and captioning, audio and video transcription, and robust web page crawling capabilities.
- Easy Deployment: Can be easily deployed using Docker and Skypilot, and is friendly for Google Colab environments.
- Interactive UI: Comes with an interactive user interface powered by Gradio for ease of use.
OmniParse addresses the significant challenge of unifying disparate data formats into a single, AI-ready structure, making it an invaluable asset for developing and deploying GenAI solutions.
Links
For more detailed information, contributions, or to explore the codebase, visit the official GitHub repository: