# text-extract-api: Advanced Document Extraction, OCR, and PII Removal with LLMs

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Source: osrepos.com
Repository profile: https://osrepos.com/repo/catchthetornado-text-extract-api
Generated for open source discovery and AI-assisted research.

text-extract-api is a powerful API designed for extracting and parsing text from various document formats, including PDF, Word, and PPTX. It utilizes modern OCRs and Ollama-supported LLMs for highly accurate text extraction, PII removal, and conversion to structured JSON or Markdown, all while maintaining data privacy through its self-hosted architecture.

GitHub: https://github.com/CatchTheTornado/text-extract-api
OSRepos URL: https://osrepos.com/repo/catchthetornado-text-extract-api

## Summary

text-extract-api is a powerful API designed for extracting and parsing text from various document formats, including PDF, Word, and PPTX. It utilizes modern OCRs and Ollama-supported LLMs for highly accurate text extraction, PII removal, and conversion to structured JSON or Markdown, all while maintaining data privacy through its self-hosted architecture.

## Topics

- anonymization
- api
- document processing
- llm
- ocr
- pdf
- pii
- python

## Repository Information

Last analyzed by OSRepos: Sun Oct 12 2025 01:50:44 GMT+0100 (Western European Summer Time)
Detail views: 5
GitHub clicks: 5

## Safety Notice

OSRepos shares public repositories for knowledge and discovery only. Review source code, dependencies, licenses, and security implications before running or installing anything.

## Content

## Introduction

The `text-extract-api` is a robust and privacy-focused solution for advanced document processing. Built with FastAPI, Celery, and Redis, it offers state-of-the-art OCR capabilities combined with Ollama-supported Large Language Models (LLMs) to extract, parse, and transform content from various document types like PDFs, Word, and PPTX files. A key advantage is its self-hosted nature, ensuring no data leaves your environment, making it ideal for sensitive information processing.

## Installation

Getting `text-extract-api` up and running is straightforward, whether you prefer a native setup or Docker.

### Prerequisites

Before you begin, ensure you have:

*   [Ollama](https://ollama.com/download)
*   [Docker](https://www.docker.com/products/docker-desktop/)

### Clone the Repository

Start by cloning the official repository:

sh
git clone https://github.com/CatchTheTornado/text-extract-api.git
cd text-extract-api


### Local Setup with Makefile

For a quick local setup, you can use the provided `Makefile`:

bash
DISABLE_VENV=1 make install
DISABLE_VENV=1 make run


### Docker Setup

For containerized deployment, use Docker Compose. First, copy the example environment file:

bash
cp .env.example .env


Then, build and run the containers:

bash
docker-compose up --build


For GPU support, use:

bash
docker-compose -f docker-compose.gpu.yml -p text-extract-api-gpu up --build


Refer to the official [README](https://github.com/CatchTheTornado/text-extract-api#getting-started) for detailed manual installation steps and specific dependencies for different operating systems.

## Examples

The `text-extract-api` provides a powerful CLI tool to interact with its functionalities. Here are a few examples to get you started:

### Pull LLM Models

Before using LLM features, pull the necessary models:

bash
python client/cli.py llm_pull --model llama3.1
python client/cli.py llm_pull --model llama3.2-vision


### Convert Document to Markdown or JSON

Upload a PDF file for OCR processing and conversion. You can specify a prompt for LLM processing and even save the result to disk.

bash
python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt --language en --storage_filename "reports/{Y}/{file_name}-{Y}-{mm}-{dd}.md"


This command processes an MRI report, removes PII, and saves the output as a Markdown file, demonstrating both OCR and LLM capabilities. Screenshots in the repository's README illustrate converting MRI reports to Markdown and JSON, and invoices with PII removal.

### Get OCR Result by Task ID

After uploading a file, you receive a task ID. You can retrieve the processing result using this ID:

bash
python client/cli.py result --task_id {your_task_id_from_upload_step}


### Online Demo

You can also try out a hosted version of the application using the CLI tool against their cloud instance. Visit the [online demo](https://demo.doctractor.com/) for more details.

## Why Use It

Choosing `text-extract-api` offers several compelling advantages for document processing:

*   **Data Privacy and Security**: Operate entirely on-premise with no external cloud dependencies, ensuring sensitive data remains within your control.
*   **High Accuracy OCR**: Integrates state-of-the-art OCR engines like EasyOCR, MiniCPM-V, and Llama 3.2 Vision, providing exceptional accuracy for various document types and languages.
*   **LLM-Enhanced Processing**: Leverage Ollama-supported LLMs to improve OCR results, fix spelling, extract structured JSON, and perform advanced tasks like PII removal.
*   **Flexible Output Formats**: Convert documents and images into highly accurate Markdown text or structured JSON, adapting to your application's needs.
*   **Scalable and Robust Architecture**: Built with FastAPI and Celery for asynchronous task processing and Redis for caching, supporting distributed workloads.
*   **Versatile Storage Options**: Supports various storage strategies, including local filesystem, Google Drive, and Amazon S3, for managing your extracted data.
*   **Developer-Friendly**: Provides a comprehensive CLI tool and API clients (e.g., Typescript) for easy integration and automation.

## Links

For more information, examples, and community support, explore these resources:

*   **GitHub Repository**: [CatchTheTornado/text-extract-api](https://github.com/CatchTheTornado/text-extract-api)
*   **Online Demo**: [demo.doctractor.com](https://demo.doctractor.com/)
*   **Discord Community**: [Join us on Discord](https://discord.gg/NJzu47Ye3a)
*   **Typescript API Client**: [text-extract-api-client](https://github.com/CatchTheTornado/text-extract-api-client)
*   **Contact**: [info@catchthetornado.com](mailto:info@catchthetornado.com)