Infinity: High-Throughput, Low-Latency Serving for Text Embeddings and Reranking

Infinity: High-Throughput, Low-Latency Serving for Text Embeddings and Reranking

Summary

Infinity is a powerful, high-throughput, and low-latency REST API designed for serving various AI models, including text embeddings, reranking, and multi-modal models. It supports deploying any model from HuggingFace with fast inference backends optimized for diverse accelerators. This engine simplifies the deployment and usage of advanced AI models for developers.

Repository Info

Updated on March 17, 2026
View on GitHub

Tags

Click on any tag to explore related repositories

Introduction

Infinity is a high-throughput, low-latency REST API designed for serving text embeddings, reranking models, CLIP, CLAP, and ColPali models. Developed by Michael Feil, this Python-based project aims to provide a robust and efficient inference engine for various AI tasks. It is released under the MIT License, ensuring open and flexible usage for developers and researchers.

Installation

Getting started with Infinity is straightforward, whether you prefer a pip installation or using pre-built Docker containers.

Via pip

First, install the package with all its dependencies:

pip install infinity-emb[all]

Then, you can launch the CLI directly:

infinity_emb v2 --model-id BAAI/bge-small-en-v1.5

For more options, check the help command:

infinity_emb v2 --help

Via Docker (Recommended)

For a more isolated and consistent environment, using Docker is recommended. Ensure you have nvidia-docker installed if you plan to use GPUs.

port=7997
model1=michaelfeil/bge-small-en-v1.5
model2=mixedbread-ai/mxbai-rerank-xsmall-v1
volume=$PWD/data

docker run -it --gpus all \
 -v $volume:/app/.cache \
 -p $port:$port \
 michaelf34/infinity:latest \
 v2 \
 --model-id $model1 \
 --model-id $model2 \
 --port $port

Examples

Infinity provides a flexible Python API for integrating its capabilities directly into your applications.

Embeddings

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine

sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
array = AsyncEngineArray.from_args([
  EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", engine="torch", embedding_dtype="float32", dtype="auto")
])

async def embed_text(engine: AsyncEmbeddingEngine): 
    async with engine: 
        embeddings, usage = await engine.embed(sentences=sentences)
    # or handle the async start / stop yourself.
    await engine.astart()
    embeddings, usage = await engine.embed(sentences=sentences)
    await engine.astop()
asyncio.run(embed_text(array[0]))

Reranking

Reranking gives you a score for similarity between a query and multiple documents.

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
query = "What is the python package infinity_emb?"
docs = ["This is a document not related to the python package infinity_emb, hence...", 
    "Paris is in France!",
    "infinity_emb is a package for sentence embeddings and rerankings using transformer models in Python!"]
array = AsyncEmbeddingEngine.from_args(
  [EngineArgs(model_name_or_path = "mixedbread-ai/mxbai-rerank-xsmall-v1", engine="torch")]
)

async def rerank(engine: AsyncEmbeddingEngine): 
    async with engine:
        ranking, usage = await engine.rerank(query=query, docs=docs)
        print(list(zip(ranking, docs)))
    # or handle the async start / stop yourself.
    await engine.astart()
    ranking, usage = await engine.rerank(query=query, docs=docs)
    await engine.astop()

asyncio.run(rerank(array[0]))

Image Embeddings: CLIP models

CLIP models are able to encode images and text at the same time.

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine

sentences = ["This is awesome.", "I am bored."]
images = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
engine_args = EngineArgs(
    model_name_or_path = "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M", 
    engine="torch"
)
array = AsyncEngineArray.from_args([engine_args])

async def embed(engine: AsyncEmbeddingEngine): 
    await engine.astart()
    embeddings, usage = await engine.embed(sentences=sentences)
    embeddings_image, _ = await engine.image_embed(images=images)
    await engine.astop()

asyncio.run(embed(array["wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"]))

Audio Embeddings: CLAP models

CLAP models are able to encode audio and text at the same time.

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
import requests
import soundfile as sf
import io

sentences = ["This is awesome.", "I am bored."]

url = "https://bigsoundbank.com/UPLOAD/wav/2380.wav"
raw_bytes = requests.get(url, stream=True).content

audios = [raw_bytes]
engine_args = EngineArgs(
    model_name_or_path = "laion/clap-htsat-unfused",
    dtype="float32", 
    engine="torch"

)
array = AsyncEngineArray.from_args([engine_args])

async def embed(engine: AsyncEmbeddingEngine): 
    await engine.astart()
    embeddings, usage = await engine.embed(sentences=sentences)
    embedding_audios = await engine.audio_embed(audios=audios)
    await engine.astop()

asyncio.run(embed(array["laion/clap-htsat-unfused"]))

Text Classification

Use text classification with Infinity's classify feature, which allows for sentiment analysis, emotion detection, and more classification tasks.

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine

sentences = ["This is awesome.", "I am bored."]
engine_args = EngineArgs(
    model_name_or_path = "SamLowe/roberta-base-go_emotions", 
    engine="torch", model_warmup=True)
array = AsyncEngineArray.from_args([engine_args])

async def classifier(engine: AsyncEmbeddingEngine): 
    async with engine:
        predictions, usage = await engine.classify(sentences=sentences)
    # or handle the async start / stop yourself.
    await engine.astart()
    predictions, usage = await engine.classify(sentences=sentences)
    await engine.astop()
asyncio.run(classifier(array["SamLowe/roberta-base-go_emotions"]))

Why Use Infinity?

Infinity offers several compelling features for deploying and managing AI models:

  • Deploy Any HuggingFace Model: Easily deploy any embedding, reranking, CLIP, or sentence-transformer model available on HuggingFace.
  • Fast Inference Backends: Built on PyTorch, Optimum (ONNX/TensorRT), and CTranslate2, Infinity leverages FlashAttention for optimal performance on NVIDIA CUDA, AMD ROCM, CPU, AWS INF2, or APPLE MPS accelerators. It also uses dynamic batching and dedicated worker threads.
  • Multi-modal and Multi-model Support: Mix and match multiple models, including multi-modal capabilities for image and audio embeddings. Infinity orchestrates their execution seamlessly.
  • Tested Implementation: The project boasts unit and end-to-end testing, ensuring accurate and reliable embeddings.
  • Easy to Use: Built on FastAPI, Infinity provides a user-friendly CLI and an OpenAPI-aligned API specification, making it simple to integrate and manage.

Links