# Infinity: High-Throughput, Low-Latency Serving for Text Embeddings and Reranking

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Source: osrepos.com
Repository profile: https://osrepos.com/repo/michaelfeil-infinity
Generated for open source discovery and AI-assisted research.

Infinity is a powerful, high-throughput, and low-latency REST API designed for serving various AI models, including text embeddings, reranking, and multi-modal models. It supports deploying any model from HuggingFace with fast inference backends optimized for diverse accelerators. This engine simplifies the deployment and usage of advanced AI models for developers.

GitHub: https://github.com/michaelfeil/infinity
OSRepos URL: https://osrepos.com/repo/michaelfeil-infinity

## Summary

Infinity is a powerful, high-throughput, and low-latency REST API designed for serving various AI models, including text embeddings, reranking, and multi-modal models. It supports deploying any model from HuggingFace with fast inference backends optimized for diverse accelerators. This engine simplifies the deployment and usage of advanced AI models for developers.

## Topics

- Python
- AI
- Machine Learning
- Embeddings
- Reranking
- LLM
- BERT Embeddings
- Inference Engine

## Repository Information

Last analyzed by OSRepos: Tue Mar 17 2026 16:56:30 GMT+0000 (Western European Standard Time)
Detail views: 2
GitHub clicks: 4

## Safety Notice

OSRepos shares public repositories for knowledge and discovery only. Review source code, dependencies, licenses, and security implications before running or installing anything.

## Content

## Introduction

Infinity is a high-throughput, low-latency REST API designed for serving text embeddings, reranking models, CLIP, CLAP, and ColPali models. Developed by Michael Feil, this Python-based project aims to provide a robust and efficient inference engine for various AI tasks. It is released under the MIT License, ensuring open and flexible usage for developers and researchers.

## Installation

Getting started with Infinity is straightforward, whether you prefer a `pip` installation or using pre-built Docker containers.

### Via pip

First, install the package with all its dependencies:

bash
pip install infinity-emb[all]


Then, you can launch the CLI directly:

bash
infinity_emb v2 --model-id BAAI/bge-small-en-v1.5


For more options, check the help command:

bash
infinity_emb v2 --help


### Via Docker (Recommended)

For a more isolated and consistent environment, using Docker is recommended. Ensure you have `nvidia-docker` installed if you plan to use GPUs.

bash
port=7997
model1=michaelfeil/bge-small-en-v1.5
model2=mixedbread-ai/mxbai-rerank-xsmall-v1
volume=$PWD/data

docker run -it --gpus all \
 -v $volume:/app/.cache \
 -p $port:$port \
 michaelf34/infinity:latest \
 v2 \
 --model-id $model1 \
 --model-id $model2 \
 --port $port


## Examples

Infinity provides a flexible Python API for integrating its capabilities directly into your applications.

### Embeddings

python
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine

sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
array = AsyncEngineArray.from_args([
  EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", engine="torch", embedding_dtype="float32", dtype="auto")
])

async def embed_text(engine: AsyncEmbeddingEngine): 
    async with engine: 
        embeddings, usage = await engine.embed(sentences=sentences)
    # or handle the async start / stop yourself.
    await engine.astart()
    embeddings, usage = await engine.embed(sentences=sentences)
    await engine.astop()
asyncio.run(embed_text(array[0]))


### Reranking

Reranking gives you a score for similarity between a query and multiple documents.

python
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
query = "What is the python package infinity_emb?"
docs = ["This is a document not related to the python package infinity_emb, hence...", 
    "Paris is in France!",
    "infinity_emb is a package for sentence embeddings and rerankings using transformer models in Python!"]
array = AsyncEmbeddingEngine.from_args(
  [EngineArgs(model_name_or_path = "mixedbread-ai/mxbai-rerank-xsmall-v1", engine="torch")]
)

async def rerank(engine: AsyncEmbeddingEngine): 
    async with engine:
        ranking, usage = await engine.rerank(query=query, docs=docs)
        print(list(zip(ranking, docs)))
    # or handle the async start / stop yourself.
    await engine.astart()
    ranking, usage = await engine.rerank(query=query, docs=docs)
    await engine.astop()

asyncio.run(rerank(array[0]))


### Image Embeddings: CLIP models

CLIP models are able to encode images and text at the same time.

python
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine

sentences = ["This is awesome.", "I am bored."]
images = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
engine_args = EngineArgs(
    model_name_or_path = "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M", 
    engine="torch"
)
array = AsyncEngineArray.from_args([engine_args])

async def embed(engine: AsyncEmbeddingEngine): 
    await engine.astart()
    embeddings, usage = await engine.embed(sentences=sentences)
    embeddings_image, _ = await engine.image_embed(images=images)
    await engine.astop()

asyncio.run(embed(array["wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"]))


### Audio Embeddings: CLAP models

CLAP models are able to encode audio and text at the same time.

python
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
import requests
import soundfile as sf
import io

sentences = ["This is awesome.", "I am bored."]

url = "https://bigsoundbank.com/UPLOAD/wav/2380.wav"
raw_bytes = requests.get(url, stream=True).content

audios = [raw_bytes]
engine_args = EngineArgs(
    model_name_or_path = "laion/clap-htsat-unfused",
    dtype="float32", 
    engine="torch"

)
array = AsyncEngineArray.from_args([engine_args])

async def embed(engine: AsyncEmbeddingEngine): 
    await engine.astart()
    embeddings, usage = await engine.embed(sentences=sentences)
    embedding_audios = await engine.audio_embed(audios=audios)
    await engine.astop()

asyncio.run(embed(array["laion/clap-htsat-unfused"]))


### Text Classification

Use text classification with Infinity's `classify` feature, which allows for sentiment analysis, emotion detection, and more classification tasks.

python
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine

sentences = ["This is awesome.", "I am bored."]
engine_args = EngineArgs(
    model_name_or_path = "SamLowe/roberta-base-go_emotions", 
    engine="torch", model_warmup=True)
array = AsyncEngineArray.from_args([engine_args])

async def classifier(engine: AsyncEmbeddingEngine): 
    async with engine:
        predictions, usage = await engine.classify(sentences=sentences)
    # or handle the async start / stop yourself.
    await engine.astart()
    predictions, usage = await engine.classify(sentences=sentences)
    await engine.astop()
asyncio.run(classifier(array["SamLowe/roberta-base-go_emotions"]))


## Why Use Infinity?

Infinity offers several compelling features for deploying and managing AI models:

*   **Deploy Any HuggingFace Model**: Easily deploy any embedding, reranking, CLIP, or sentence-transformer model available on HuggingFace.
*   **Fast Inference Backends**: Built on PyTorch, Optimum (ONNX/TensorRT), and CTranslate2, Infinity leverages FlashAttention for optimal performance on NVIDIA CUDA, AMD ROCM, CPU, AWS INF2, or APPLE MPS accelerators. It also uses dynamic batching and dedicated worker threads.
*   **Multi-modal and Multi-model Support**: Mix and match multiple models, including multi-modal capabilities for image and audio embeddings. Infinity orchestrates their execution seamlessly.
*   **Tested Implementation**: The project boasts unit and end-to-end testing, ensuring accurate and reliable embeddings.
*   **Easy to Use**: Built on FastAPI, Infinity provides a user-friendly CLI and an OpenAPI-aligned API specification, making it simple to integrate and manage.

## Links

*   **GitHub Repository**: [https://github.com/michaelfeil/infinity](https://github.com/michaelfeil/infinity)
*   **Official Documentation**: [https://michaelfeil.github.io/infinity](https://michaelfeil.github.io/infinity)
*   **Python Client**: [https://github.com/michaelfeil/infinity/tree/main/libs/client_infinity/infinity_client](https://github.com/michaelfeil/infinity/tree/main/libs/client_infinity/infinity_client)
*   **Integrations**:
    *   [Runpod Serverless Deployments](https://github.com/runpod-workers/worker-infinity-embedding)
    *   [Truefoundry Cognita](https://github.com/truefoundry/cognita)
    *   [Langchain Example](https://python.langchain.com/docs/integrations/text_embedding/infinity)
    *   [imitater - Unified Language Model Server](https://github.com/the-seeds/imitater)
    *   [Dwarves Foundation: LLM Hosting Examples](https://github.com/dwarvesf/llm-hosting)