{"name":"E2M: Convert Various File Types to Markdown for RAG and LLM Training","description":"E2M is a Python library designed to convert diverse file types, including documents, web pages, and audio, into Markdown format. It features a robust parser-converter architecture, making it highly flexible and easy to integrate. This tool is specifically aimed at generating high-quality data for Retrieval-Augmented Generation (RAG) and large language model training.","github":"https://github.com/wisupai/e2m","url":"https://osrepos.com/repo/wisupai-e2m","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/wisupai-e2m","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/wisupai-e2m.md","json":"https://osrepos.com/repo/wisupai-e2m.json","topics":["e2m","markdown-conversion","pdf-to-markdown","llm-data","rag-data","python","document-processing","text-extraction"],"keywords":["e2m","markdown-conversion","pdf-to-markdown","llm-data","rag-data","python","document-processing","text-extraction"],"stars":null,"summary":"E2M is a Python library designed to convert diverse file types, including documents, web pages, and audio, into Markdown format. It features a robust parser-converter architecture, making it highly flexible and easy to integrate. This tool is specifically aimed at generating high-quality data for Retrieval-Augmented Generation (RAG) and large language model training.","content":"## Introduction\nE2M, short for \"Everything to Markdown,\" is a powerful Python library designed to streamline the conversion of various file types into a clean, structured Markdown format. Its core mission is to provide high-quality data essential for Retrieval-Augmented Generation (RAG) systems and for training or fine-tuning large language models (LLMs).\n\nThe project employs a modular parser-converter architecture. Parsers are responsible for extracting raw text or image data from diverse sources like PDF, DOCX, HTML, URLs, and even audio files (MP3, M4A). Following this, converters transform the extracted data into the desired Markdown output, ensuring consistency and readiness for AI applications. E2M supports a wide array of input formats, including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a.\n\n## Installation\nGetting started with E2M is straightforward. Follow these steps to set up your environment and install the library:\n\nFirst, create a dedicated Conda environment and activate it:\nbash\nconda create -n e2m python=3.10\nconda activate e2m\n\n\nEnsure your `pip` is up to date:\nbash\npip install --upgrade pip\n\n\nFinally, install E2M. The most recommended method is via Git:\nbash\npip install git+https://github.com/wisupai/e2m.git --index-url https://pypi.org/simple\n\nAlternatively, you can install from PyPI:\nbash\npip install --upgrade wisup_e2m\n\n\n## Examples\nE2M offers intuitive APIs for both parsing and converting. Here are a few quick examples to demonstrate its usage:\n\n### Pdf Parser\npython\nfrom wisup_e2m import PdfParser\n\npdf_path = \"./test.pdf\"\nparser = PdfParser(engine=\"marker\") # pdf engines: marker, unstructured, surya_layout\npdf_data = parser.parse(pdf_path)\nprint(pdf_data.text)\n\n\n### Url Parser\npython\nfrom wisup_e2m import UrlParser\n\nurl = \"https://www.example.com\"\nparser = UrlParser(engine=\"jina\") # url engines: jina, firecrawl, unstructured\nurl_data = parser.parse(url)\nprint(url_data.text)\n\n\n### Voice Parser\npython\nfrom wisup_e2m import VoiceParser\n\nvoice_path = \"./test.mp3\"\nparser = VoiceParser(\n  engine=\"openai_whisper_local\", # voice engines: openai_whisper_api, openai_whisper_local\n  model=\"large\" # available models: https://github.com/openai/whisper#available-models-and-languages\n  )\n\nvoice_data = parser.parse(voice_path)\nprint(voice_data.text)\n\n\n### Text Converter\npython\nfrom wisup_e2m import TextConverter\n\ntext = \"Parsed text data from any parser\"\nconverter = TextConverter(\n  engine=\"litellm\", # text engines: litellm\n  model=\"deepseek/deepseek-chat\",\n  api_key=\"your api key\",\n  base_url=\"your base url\"\n  )\ntext_data = converter.convert(text)\nprint(text_data)\n\n\nFor more advanced usage, including an integrated `E2MParser` and `E2MConverter` with `config.yaml` support, refer to the official documentation.\n\n## Why Use E2M?\nE2M stands out as a versatile and powerful tool for data preparation, especially for AI-driven projects.\n\n*   **Broad File Support**: It handles a comprehensive range of document, web, and audio formats, centralizing your data conversion needs.\n*   **Optimized for AI**: Specifically designed to produce high-quality, clean Markdown output, making it ideal for RAG systems and LLM training datasets.\n*   **Flexible Architecture**: The distinct parser and converter components allow for easy customization and integration of different engines and models.\n*   **Ease of Integration**: With simple installation and clear API examples, E2M can be quickly incorporated into existing workflows.\n*   **Open Source**: Licensed under Apache-2.0, E2M offers a transparent and community-driven solution for your conversion tasks.\n\n## Links\n*   **GitHub Repository**: [https://github.com/wisupai/e2m](https://github.com/wisupai/e2m)\n*   **PyPI Package**: [https://pypi.org/project/wisup_e2m/](https://pypi.org/project/wisup_e2m/)\n*   **License**: [https://github.com/wisupai/e2m/blob/main/LICENSE](https://github.com/wisupai/e2m/blob/main/LICENSE)","metrics":{"detailViews":3,"githubClicks":3},"dates":{"published":null,"modified":"2025-12-24T00:01:01.000Z"}}