E2M: Convert Various File Types to Markdown for RAG and LLM Training

Introduction

E2M, short for "Everything to Markdown," is a powerful Python library designed to streamline the conversion of various file types into a clean, structured Markdown format. Its core mission is to provide high-quality data essential for Retrieval-Augmented Generation (RAG) systems and for training or fine-tuning large language models (LLMs).

The project employs a modular parser-converter architecture. Parsers are responsible for extracting raw text or image data from diverse sources like PDF, DOCX, HTML, URLs, and even audio files (MP3, M4A). Following this, converters transform the extracted data into the desired Markdown output, ensuring consistency and readiness for AI applications. E2M supports a wide array of input formats, including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a.

Installation

Getting started with E2M is straightforward. Follow these steps to set up your environment and install the library:

First, create a dedicated Conda environment and activate it:

conda create -n e2m python=3.10
conda activate e2m

Ensure your pip is up to date:

pip install --upgrade pip

Finally, install E2M. The most recommended method is via Git:

pip install git+https://github.com/wisupai/e2m.git --index-url https://pypi.org/simple

Alternatively, you can install from PyPI:

pip install --upgrade wisup_e2m

Examples

E2M offers intuitive APIs for both parsing and converting. Here are a few quick examples to demonstrate its usage:

Pdf Parser

from wisup_e2m import PdfParser

pdf_path = "./test.pdf"
parser = PdfParser(engine="marker") # pdf engines: marker, unstructured, surya_layout
pdf_data = parser.parse(pdf_path)
print(pdf_data.text)

Url Parser

from wisup_e2m import UrlParser

url = "https://www.example.com"
parser = UrlParser(engine="jina") # url engines: jina, firecrawl, unstructured
url_data = parser.parse(url)
print(url_data.text)

Voice Parser

from wisup_e2m import VoiceParser

voice_path = "./test.mp3"
parser = VoiceParser(
  engine="openai_whisper_local", # voice engines: openai_whisper_api, openai_whisper_local
  model="large" # available models: https://github.com/openai/whisper#available-models-and-languages
  )

voice_data = parser.parse(voice_path)
print(voice_data.text)

Text Converter

from wisup_e2m import TextConverter

text = "Parsed text data from any parser"
converter = TextConverter(
  engine="litellm", # text engines: litellm
  model="deepseek/deepseek-chat",
  api_key="your api key",
  base_url="your base url"
  )
text_data = converter.convert(text)
print(text_data)

For more advanced usage, including an integrated E2MParser and E2MConverter with config.yaml support, refer to the official documentation.

Why Use E2M?

E2M stands out as a versatile and powerful tool for data preparation, especially for AI-driven projects.

Broad File Support: It handles a comprehensive range of document, web, and audio formats, centralizing your data conversion needs.
Optimized for AI: Specifically designed to produce high-quality, clean Markdown output, making it ideal for RAG systems and LLM training datasets.
Flexible Architecture: The distinct parser and converter components allow for easy customization and integration of different engines and models.
Ease of Integration: With simple installation and clear API examples, E2M can be quickly incorporated into existing workflows.
Open Source: Licensed under Apache-2.0, E2M offers a transparent and community-driven solution for your conversion tasks.

E2M: Convert Various File Types to Markdown for RAG and LLM Training

Summary

Repository Info

Tags