Kreuzberg: A Polyglot Document Intelligence Framework with a Rust Core

Introduction

Kreuzberg is a robust, polyglot document intelligence framework built with a high-performance Rust core. It is designed to extract text, metadata, and structured information from an extensive range of over 50 file formats, including PDFs, Office documents, and various image types. Kreuzberg offers native bindings for popular programming languages such as Rust, Python, Ruby, Go, Java, C#, PHP, Elixir, and TypeScript/Node.js, making it highly versatile for diverse development environments. Beyond its library form, it can also be deployed as a CLI tool, a REST API server, or an MCP server, providing flexible integration options. Currently, Kreuzberg v4.0.0 is in its Release Candidate stage, with ongoing development and improvements.

Installation

Kreuzberg provides comprehensive documentation and specific installation guides for each supported language binding. This ensures developers can quickly get started with examples and best practices tailored to their chosen platform.

You can find detailed installation instructions for various environments:

Rust: Core library
Python: PyPI package
Ruby: RubyGems package
Node.js/TypeScript: @kreuzberg/node (NAPI-RS) and @kreuzberg/wasm (WebAssembly)
Go: Go module
Java: Maven Central
C#: NuGet package
PHP: Composer package
Elixir: Hex package
CLI: Cross-platform binary
Docker: Official images

For embeddings functionality, ensure ONNX Runtime 1.22.x is installed separately, as detailed in the official documentation.

Examples

Kreuzberg is designed for flexible integration, allowing developers to use it as a library within their applications, as a command-line interface (CLI) tool for script-based processing, or as a REST API server for microservices architectures.

For detailed code examples and usage patterns specific to each supported programming language, please refer to the official documentation. The documentation provides comprehensive guides and snippets to help you implement document intelligence features effectively.

Why Use Kreuzberg?

Kreuzberg stands out as a powerful solution for document intelligence due to several key features:

Polyglot Nature: With native bindings for a wide array of languages, Kreuzberg seamlessly integrates into almost any tech stack.
High Performance: Its Rust core, leveraging native PDFium, SIMD optimizations, and full parallelism, delivers exceptional speed and efficiency for document processing.
Extensive Format Support: It supports 56+ file formats across 8 categories, including advanced OCR capabilities for images and PDFs, and intelligent table detection.
Extensible Architecture: A plugin system allows for custom OCR backends, validators, post-processors, and document extractors, making it highly adaptable to specific needs.
Flexible Deployment: Whether you need a library for direct application integration, a CLI for automation, or a REST API/MCP server for scalable services, Kreuzberg offers versatile deployment options.
Memory Efficiency: Streaming parsers are implemented to handle multi-GB files efficiently, minimizing memory footprint.
Advanced Features: Includes batch processing, support for password-protected PDFs, automatic language detection, and comprehensive metadata extraction.

Kreuzberg: A Polyglot Document Intelligence Framework with a Rust Core

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why Use Kreuzberg?

Links