Infinity: High-Throughput, Low-Latency Serving for Text Embeddings and Reranking

Summary
Infinity is a powerful, high-throughput, and low-latency REST API designed for serving various AI models, including text embeddings, reranking, and multi-modal models. It supports deploying any model from HuggingFace with fast inference backends optimized for diverse accelerators. This engine simplifies the deployment and usage of advanced AI models for developers.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
Infinity is a high-throughput, low-latency REST API designed for serving text embeddings, reranking models, CLIP, CLAP, and ColPali models. Developed by Michael Feil, this Python-based project aims to provide a robust and efficient inference engine for various AI tasks. It is released under the MIT License, ensuring open and flexible usage for developers and researchers.
Installation
Getting started with Infinity is straightforward, whether you prefer a pip installation or using pre-built Docker containers.
Via pip
First, install the package with all its dependencies:
pip install infinity-emb[all]
Then, you can launch the CLI directly:
infinity_emb v2 --model-id BAAI/bge-small-en-v1.5
For more options, check the help command:
infinity_emb v2 --help
Via Docker (Recommended)
For a more isolated and consistent environment, using Docker is recommended. Ensure you have nvidia-docker installed if you plan to use GPUs.
port=7997
model1=michaelfeil/bge-small-en-v1.5
model2=mixedbread-ai/mxbai-rerank-xsmall-v1
volume=$PWD/data
docker run -it --gpus all \
-v $volume:/app/.cache \
-p $port:$port \
michaelf34/infinity:latest \
v2 \
--model-id $model1 \
--model-id $model2 \
--port $port
Examples
Infinity provides a flexible Python API for integrating its capabilities directly into your applications.
Embeddings
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
array = AsyncEngineArray.from_args([
EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", engine="torch", embedding_dtype="float32", dtype="auto")
])
async def embed_text(engine: AsyncEmbeddingEngine):
async with engine:
embeddings, usage = await engine.embed(sentences=sentences)
# or handle the async start / stop yourself.
await engine.astart()
embeddings, usage = await engine.embed(sentences=sentences)
await engine.astop()
asyncio.run(embed_text(array[0]))
Reranking
Reranking gives you a score for similarity between a query and multiple documents.
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
query = "What is the python package infinity_emb?"
docs = ["This is a document not related to the python package infinity_emb, hence...",
"Paris is in France!",
"infinity_emb is a package for sentence embeddings and rerankings using transformer models in Python!"]
array = AsyncEmbeddingEngine.from_args(
[EngineArgs(model_name_or_path = "mixedbread-ai/mxbai-rerank-xsmall-v1", engine="torch")]
)
async def rerank(engine: AsyncEmbeddingEngine):
async with engine:
ranking, usage = await engine.rerank(query=query, docs=docs)
print(list(zip(ranking, docs)))
# or handle the async start / stop yourself.
await engine.astart()
ranking, usage = await engine.rerank(query=query, docs=docs)
await engine.astop()
asyncio.run(rerank(array[0]))
Image Embeddings: CLIP models
CLIP models are able to encode images and text at the same time.
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
sentences = ["This is awesome.", "I am bored."]
images = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
engine_args = EngineArgs(
model_name_or_path = "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M",
engine="torch"
)
array = AsyncEngineArray.from_args([engine_args])
async def embed(engine: AsyncEmbeddingEngine):
await engine.astart()
embeddings, usage = await engine.embed(sentences=sentences)
embeddings_image, _ = await engine.image_embed(images=images)
await engine.astop()
asyncio.run(embed(array["wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"]))
Audio Embeddings: CLAP models
CLAP models are able to encode audio and text at the same time.
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
import requests
import soundfile as sf
import io
sentences = ["This is awesome.", "I am bored."]
url = "https://bigsoundbank.com/UPLOAD/wav/2380.wav"
raw_bytes = requests.get(url, stream=True).content
audios = [raw_bytes]
engine_args = EngineArgs(
model_name_or_path = "laion/clap-htsat-unfused",
dtype="float32",
engine="torch"
)
array = AsyncEngineArray.from_args([engine_args])
async def embed(engine: AsyncEmbeddingEngine):
await engine.astart()
embeddings, usage = await engine.embed(sentences=sentences)
embedding_audios = await engine.audio_embed(audios=audios)
await engine.astop()
asyncio.run(embed(array["laion/clap-htsat-unfused"]))
Text Classification
Use text classification with Infinity's classify feature, which allows for sentiment analysis, emotion detection, and more classification tasks.
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
sentences = ["This is awesome.", "I am bored."]
engine_args = EngineArgs(
model_name_or_path = "SamLowe/roberta-base-go_emotions",
engine="torch", model_warmup=True)
array = AsyncEngineArray.from_args([engine_args])
async def classifier(engine: AsyncEmbeddingEngine):
async with engine:
predictions, usage = await engine.classify(sentences=sentences)
# or handle the async start / stop yourself.
await engine.astart()
predictions, usage = await engine.classify(sentences=sentences)
await engine.astop()
asyncio.run(classifier(array["SamLowe/roberta-base-go_emotions"]))
Why Use Infinity?
Infinity offers several compelling features for deploying and managing AI models:
- Deploy Any HuggingFace Model: Easily deploy any embedding, reranking, CLIP, or sentence-transformer model available on HuggingFace.
- Fast Inference Backends: Built on PyTorch, Optimum (ONNX/TensorRT), and CTranslate2, Infinity leverages FlashAttention for optimal performance on NVIDIA CUDA, AMD ROCM, CPU, AWS INF2, or APPLE MPS accelerators. It also uses dynamic batching and dedicated worker threads.
- Multi-modal and Multi-model Support: Mix and match multiple models, including multi-modal capabilities for image and audio embeddings. Infinity orchestrates their execution seamlessly.
- Tested Implementation: The project boasts unit and end-to-end testing, ensuring accurate and reliable embeddings.
- Easy to Use: Built on FastAPI, Infinity provides a user-friendly CLI and an OpenAPI-aligned API specification, making it simple to integrate and manage.
Links
- GitHub Repository: https://github.com/michaelfeil/infinity
- Official Documentation: https://michaelfeil.github.io/infinity
- Python Client: https://github.com/michaelfeil/infinity/tree/main/libs/client_infinity/infinity_client
- Integrations: