Kreuzberg: A Polyglot Document Intelligence Framework with a Rust Core

Summary
Kreuzberg is a powerful polyglot document intelligence framework built with a high-performance Rust core. It enables extraction of text, metadata, and structured information from over 50 file formats, including PDFs, Office documents, and images. Developers can leverage Kreuzberg across multiple languages like Rust, Python, Ruby, Go, and Node.js, or utilize it via CLI, REST API, or MCP server.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
Kreuzberg is a robust, polyglot document intelligence framework built with a high-performance Rust core. It is designed to extract text, metadata, and structured information from an extensive range of over 50 file formats, including PDFs, Office documents, and various image types. Kreuzberg offers native bindings for popular programming languages such as Rust, Python, Ruby, Go, Java, C#, PHP, Elixir, and TypeScript/Node.js, making it highly versatile for diverse development environments. Beyond its library form, it can also be deployed as a CLI tool, a REST API server, or an MCP server, providing flexible integration options. Currently, Kreuzberg v4.0.0 is in its Release Candidate stage, with ongoing development and improvements.
Installation
Kreuzberg provides comprehensive documentation and specific installation guides for each supported language binding. This ensures developers can quickly get started with examples and best practices tailored to their chosen platform.
You can find detailed installation instructions for various environments:
- Rust: Core library
- Python: PyPI package
- Ruby: RubyGems package
- Node.js/TypeScript: @kreuzberg/node (NAPI-RS) and @kreuzberg/wasm (WebAssembly)
- Go: Go module
- Java: Maven Central
- C#: NuGet package
- PHP: Composer package
- Elixir: Hex package
- CLI: Cross-platform binary
- Docker: Official images
For embeddings functionality, ensure ONNX Runtime 1.22.x is installed separately, as detailed in the official documentation.
Examples
Kreuzberg is designed for flexible integration, allowing developers to use it as a library within their applications, as a command-line interface (CLI) tool for script-based processing, or as a REST API server for microservices architectures.
For detailed code examples and usage patterns specific to each supported programming language, please refer to the official documentation. The documentation provides comprehensive guides and snippets to help you implement document intelligence features effectively.
Why Use Kreuzberg?
Kreuzberg stands out as a powerful solution for document intelligence due to several key features:
- Polyglot Nature: With native bindings for a wide array of languages, Kreuzberg seamlessly integrates into almost any tech stack.
- High Performance: Its Rust core, leveraging native PDFium, SIMD optimizations, and full parallelism, delivers exceptional speed and efficiency for document processing.
- Extensive Format Support: It supports 56+ file formats across 8 categories, including advanced OCR capabilities for images and PDFs, and intelligent table detection.
- Extensible Architecture: A plugin system allows for custom OCR backends, validators, post-processors, and document extractors, making it highly adaptable to specific needs.
- Flexible Deployment: Whether you need a library for direct application integration, a CLI for automation, or a REST API/MCP server for scalable services, Kreuzberg offers versatile deployment options.
- Memory Efficiency: Streaming parsers are implemented to handle multi-GB files efficiently, minimizing memory footprint.
- Advanced Features: Includes batch processing, support for password-protected PDFs, automatic language detection, and comprehensive metadata extraction.
Links
- GitHub Repository: kreuzberg-dev/kreuzberg
- Official Documentation: kreuzberg.dev
- Discord Community: Join Discord