# E2M: Convert Various File Types to Markdown for RAG and LLM Training

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Source: osrepos.com
Repository profile: https://osrepos.com/repo/wisupai-e2m
Generated for open source discovery and AI-assisted research.

E2M is a Python library designed to convert diverse file types, including documents, web pages, and audio, into Markdown format. It features a robust parser-converter architecture, making it highly flexible and easy to integrate. This tool is specifically aimed at generating high-quality data for Retrieval-Augmented Generation (RAG) and large language model training.

GitHub: https://github.com/wisupai/e2m
OSRepos URL: https://osrepos.com/repo/wisupai-e2m

## Summary

E2M is a Python library designed to convert diverse file types, including documents, web pages, and audio, into Markdown format. It features a robust parser-converter architecture, making it highly flexible and easy to integrate. This tool is specifically aimed at generating high-quality data for Retrieval-Augmented Generation (RAG) and large language model training.

## Topics

- e2m
- markdown-conversion
- pdf-to-markdown
- llm-data
- rag-data
- python
- document-processing
- text-extraction

## Repository Information

Last analyzed by OSRepos: Wed Dec 24 2025 00:01:01 GMT+0000 (Western European Standard Time)
Detail views: 3
GitHub clicks: 3

## Safety Notice

OSRepos shares public repositories for knowledge and discovery only. Review source code, dependencies, licenses, and security implications before running or installing anything.

## Content

## Introduction
E2M, short for "Everything to Markdown," is a powerful Python library designed to streamline the conversion of various file types into a clean, structured Markdown format. Its core mission is to provide high-quality data essential for Retrieval-Augmented Generation (RAG) systems and for training or fine-tuning large language models (LLMs).

The project employs a modular parser-converter architecture. Parsers are responsible for extracting raw text or image data from diverse sources like PDF, DOCX, HTML, URLs, and even audio files (MP3, M4A). Following this, converters transform the extracted data into the desired Markdown output, ensuring consistency and readiness for AI applications. E2M supports a wide array of input formats, including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a.

## Installation
Getting started with E2M is straightforward. Follow these steps to set up your environment and install the library:

First, create a dedicated Conda environment and activate it:
bash
conda create -n e2m python=3.10
conda activate e2m


Ensure your `pip` is up to date:
bash
pip install --upgrade pip


Finally, install E2M. The most recommended method is via Git:
bash
pip install git+https://github.com/wisupai/e2m.git --index-url https://pypi.org/simple

Alternatively, you can install from PyPI:
bash
pip install --upgrade wisup_e2m


## Examples
E2M offers intuitive APIs for both parsing and converting. Here are a few quick examples to demonstrate its usage:

### Pdf Parser
python
from wisup_e2m import PdfParser

pdf_path = "./test.pdf"
parser = PdfParser(engine="marker") # pdf engines: marker, unstructured, surya_layout
pdf_data = parser.parse(pdf_path)
print(pdf_data.text)


### Url Parser
python
from wisup_e2m import UrlParser

url = "https://www.example.com"
parser = UrlParser(engine="jina") # url engines: jina, firecrawl, unstructured
url_data = parser.parse(url)
print(url_data.text)


### Voice Parser
python
from wisup_e2m import VoiceParser

voice_path = "./test.mp3"
parser = VoiceParser(
  engine="openai_whisper_local", # voice engines: openai_whisper_api, openai_whisper_local
  model="large" # available models: https://github.com/openai/whisper#available-models-and-languages
  )

voice_data = parser.parse(voice_path)
print(voice_data.text)


### Text Converter
python
from wisup_e2m import TextConverter

text = "Parsed text data from any parser"
converter = TextConverter(
  engine="litellm", # text engines: litellm
  model="deepseek/deepseek-chat",
  api_key="your api key",
  base_url="your base url"
  )
text_data = converter.convert(text)
print(text_data)


For more advanced usage, including an integrated `E2MParser` and `E2MConverter` with `config.yaml` support, refer to the official documentation.

## Why Use E2M?
E2M stands out as a versatile and powerful tool for data preparation, especially for AI-driven projects.

*   **Broad File Support**: It handles a comprehensive range of document, web, and audio formats, centralizing your data conversion needs.
*   **Optimized for AI**: Specifically designed to produce high-quality, clean Markdown output, making it ideal for RAG systems and LLM training datasets.
*   **Flexible Architecture**: The distinct parser and converter components allow for easy customization and integration of different engines and models.
*   **Ease of Integration**: With simple installation and clear API examples, E2M can be quickly incorporated into existing workflows.
*   **Open Source**: Licensed under Apache-2.0, E2M offers a transparent and community-driven solution for your conversion tasks.

## Links
*   **GitHub Repository**: [https://github.com/wisupai/e2m](https://github.com/wisupai/e2m)
*   **PyPI Package**: [https://pypi.org/project/wisup_e2m/](https://pypi.org/project/wisup_e2m/)
*   **License**: [https://github.com/wisupai/e2m/blob/main/LICENSE](https://github.com/wisupai/e2m/blob/main/LICENSE)