{"name":"text-extract-api: Advanced Document Extraction, OCR, and PII Removal with LLMs","description":"text-extract-api is a powerful API designed for extracting and parsing text from various document formats, including PDF, Word, and PPTX. It utilizes modern OCRs and Ollama-supported LLMs for highly accurate text extraction, PII removal, and conversion to structured JSON or Markdown, all while maintaining data privacy through its self-hosted architecture.","github":"https://github.com/CatchTheTornado/text-extract-api","url":"https://osrepos.com/repo/catchthetornado-text-extract-api","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/catchthetornado-text-extract-api","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/catchthetornado-text-extract-api.md","json":"https://osrepos.com/repo/catchthetornado-text-extract-api.json","topics":["anonymization","api","document processing","llm","ocr","pdf","pii","python"],"keywords":["anonymization","api","document processing","llm","ocr","pdf","pii","python"],"stars":null,"summary":"text-extract-api is a powerful API designed for extracting and parsing text from various document formats, including PDF, Word, and PPTX. It utilizes modern OCRs and Ollama-supported LLMs for highly accurate text extraction, PII removal, and conversion to structured JSON or Markdown, all while maintaining data privacy through its self-hosted architecture.","content":"## Introduction\n\nThe `text-extract-api` is a robust and privacy-focused solution for advanced document processing. Built with FastAPI, Celery, and Redis, it offers state-of-the-art OCR capabilities combined with Ollama-supported Large Language Models (LLMs) to extract, parse, and transform content from various document types like PDFs, Word, and PPTX files. A key advantage is its self-hosted nature, ensuring no data leaves your environment, making it ideal for sensitive information processing.\n\n## Installation\n\nGetting `text-extract-api` up and running is straightforward, whether you prefer a native setup or Docker.\n\n### Prerequisites\n\nBefore you begin, ensure you have:\n\n*   [Ollama](https://ollama.com/download)\n*   [Docker](https://www.docker.com/products/docker-desktop/)\n\n### Clone the Repository\n\nStart by cloning the official repository:\n\nsh\ngit clone https://github.com/CatchTheTornado/text-extract-api.git\ncd text-extract-api\n\n\n### Local Setup with Makefile\n\nFor a quick local setup, you can use the provided `Makefile`:\n\nbash\nDISABLE_VENV=1 make install\nDISABLE_VENV=1 make run\n\n\n### Docker Setup\n\nFor containerized deployment, use Docker Compose. First, copy the example environment file:\n\nbash\ncp .env.example .env\n\n\nThen, build and run the containers:\n\nbash\ndocker-compose up --build\n\n\nFor GPU support, use:\n\nbash\ndocker-compose -f docker-compose.gpu.yml -p text-extract-api-gpu up --build\n\n\nRefer to the official [README](https://github.com/CatchTheTornado/text-extract-api#getting-started) for detailed manual installation steps and specific dependencies for different operating systems.\n\n## Examples\n\nThe `text-extract-api` provides a powerful CLI tool to interact with its functionalities. Here are a few examples to get you started:\n\n### Pull LLM Models\n\nBefore using LLM features, pull the necessary models:\n\nbash\npython client/cli.py llm_pull --model llama3.1\npython client/cli.py llm_pull --model llama3.2-vision\n\n\n### Convert Document to Markdown or JSON\n\nUpload a PDF file for OCR processing and conversion. You can specify a prompt for LLM processing and even save the result to disk.\n\nbash\npython client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt --language en --storage_filename \"reports/{Y}/{file_name}-{Y}-{mm}-{dd}.md\"\n\n\nThis command processes an MRI report, removes PII, and saves the output as a Markdown file, demonstrating both OCR and LLM capabilities. Screenshots in the repository's README illustrate converting MRI reports to Markdown and JSON, and invoices with PII removal.\n\n### Get OCR Result by Task ID\n\nAfter uploading a file, you receive a task ID. You can retrieve the processing result using this ID:\n\nbash\npython client/cli.py result --task_id {your_task_id_from_upload_step}\n\n\n### Online Demo\n\nYou can also try out a hosted version of the application using the CLI tool against their cloud instance. Visit the [online demo](https://demo.doctractor.com/) for more details.\n\n## Why Use It\n\nChoosing `text-extract-api` offers several compelling advantages for document processing:\n\n*   **Data Privacy and Security**: Operate entirely on-premise with no external cloud dependencies, ensuring sensitive data remains within your control.\n*   **High Accuracy OCR**: Integrates state-of-the-art OCR engines like EasyOCR, MiniCPM-V, and Llama 3.2 Vision, providing exceptional accuracy for various document types and languages.\n*   **LLM-Enhanced Processing**: Leverage Ollama-supported LLMs to improve OCR results, fix spelling, extract structured JSON, and perform advanced tasks like PII removal.\n*   **Flexible Output Formats**: Convert documents and images into highly accurate Markdown text or structured JSON, adapting to your application's needs.\n*   **Scalable and Robust Architecture**: Built with FastAPI and Celery for asynchronous task processing and Redis for caching, supporting distributed workloads.\n*   **Versatile Storage Options**: Supports various storage strategies, including local filesystem, Google Drive, and Amazon S3, for managing your extracted data.\n*   **Developer-Friendly**: Provides a comprehensive CLI tool and API clients (e.g., Typescript) for easy integration and automation.\n\n## Links\n\nFor more information, examples, and community support, explore these resources:\n\n*   **GitHub Repository**: [CatchTheTornado/text-extract-api](https://github.com/CatchTheTornado/text-extract-api)\n*   **Online Demo**: [demo.doctractor.com](https://demo.doctractor.com/)\n*   **Discord Community**: [Join us on Discord](https://discord.gg/NJzu47Ye3a)\n*   **Typescript API Client**: [text-extract-api-client](https://github.com/CatchTheTornado/text-extract-api-client)\n*   **Contact**: [info@catchthetornado.com](mailto:info@catchthetornado.com)","metrics":{"detailViews":5,"githubClicks":5},"dates":{"published":null,"modified":"2025-10-12T00:50:44.000Z"}}