{"name":"Infinity: High-Throughput, Low-Latency Serving for Text Embeddings and Reranking","description":"Infinity is a powerful, high-throughput, and low-latency REST API designed for serving various AI models, including text embeddings, reranking, and multi-modal models. It supports deploying any model from HuggingFace with fast inference backends optimized for diverse accelerators. This engine simplifies the deployment and usage of advanced AI models for developers.","github":"https://github.com/michaelfeil/infinity","url":"https://osrepos.com/repo/michaelfeil-infinity","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/michaelfeil-infinity","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/michaelfeil-infinity.md","json":"https://osrepos.com/repo/michaelfeil-infinity.json","topics":["Python","AI","Machine Learning","Embeddings","Reranking","LLM","BERT Embeddings","Inference Engine"],"keywords":["Python","AI","Machine Learning","Embeddings","Reranking","LLM","BERT Embeddings","Inference Engine"],"stars":null,"summary":"Infinity is a powerful, high-throughput, and low-latency REST API designed for serving various AI models, including text embeddings, reranking, and multi-modal models. It supports deploying any model from HuggingFace with fast inference backends optimized for diverse accelerators. This engine simplifies the deployment and usage of advanced AI models for developers.","content":"## Introduction\n\nInfinity is a high-throughput, low-latency REST API designed for serving text embeddings, reranking models, CLIP, CLAP, and ColPali models. Developed by Michael Feil, this Python-based project aims to provide a robust and efficient inference engine for various AI tasks. It is released under the MIT License, ensuring open and flexible usage for developers and researchers.\n\n## Installation\n\nGetting started with Infinity is straightforward, whether you prefer a `pip` installation or using pre-built Docker containers.\n\n### Via pip\n\nFirst, install the package with all its dependencies:\n\nbash\npip install infinity-emb[all]\n\n\nThen, you can launch the CLI directly:\n\nbash\ninfinity_emb v2 --model-id BAAI/bge-small-en-v1.5\n\n\nFor more options, check the help command:\n\nbash\ninfinity_emb v2 --help\n\n\n### Via Docker (Recommended)\n\nFor a more isolated and consistent environment, using Docker is recommended. Ensure you have `nvidia-docker` installed if you plan to use GPUs.\n\nbash\nport=7997\nmodel1=michaelfeil/bge-small-en-v1.5\nmodel2=mixedbread-ai/mxbai-rerank-xsmall-v1\nvolume=$PWD/data\n\ndocker run -it --gpus all \\\n -v $volume:/app/.cache \\\n -p $port:$port \\\n michaelf34/infinity:latest \\\n v2 \\\n --model-id $model1 \\\n --model-id $model2 \\\n --port $port\n\n\n## Examples\n\nInfinity provides a flexible Python API for integrating its capabilities directly into your applications.\n\n### Embeddings\n\npython\nimport asyncio\nfrom infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine\n\nsentences = [\"Embed this is sentence via Infinity.\", \"Paris is in France.\"]\narray = AsyncEngineArray.from_args([\n  EngineArgs(model_name_or_path = \"BAAI/bge-small-en-v1.5\", engine=\"torch\", embedding_dtype=\"float32\", dtype=\"auto\")\n])\n\nasync def embed_text(engine: AsyncEmbeddingEngine): \n    async with engine: \n        embeddings, usage = await engine.embed(sentences=sentences)\n    # or handle the async start / stop yourself.\n    await engine.astart()\n    embeddings, usage = await engine.embed(sentences=sentences)\n    await engine.astop()\nasyncio.run(embed_text(array[0]))\n\n\n### Reranking\n\nReranking gives you a score for similarity between a query and multiple documents.\n\npython\nimport asyncio\nfrom infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine\nquery = \"What is the python package infinity_emb?\"\ndocs = [\"This is a document not related to the python package infinity_emb, hence...\", \n    \"Paris is in France!\",\n    \"infinity_emb is a package for sentence embeddings and rerankings using transformer models in Python!\"]\narray = AsyncEmbeddingEngine.from_args(\n  [EngineArgs(model_name_or_path = \"mixedbread-ai/mxbai-rerank-xsmall-v1\", engine=\"torch\")]\n)\n\nasync def rerank(engine: AsyncEmbeddingEngine): \n    async with engine:\n        ranking, usage = await engine.rerank(query=query, docs=docs)\n        print(list(zip(ranking, docs)))\n    # or handle the async start / stop yourself.\n    await engine.astart()\n    ranking, usage = await engine.rerank(query=query, docs=docs)\n    await engine.astop()\n\nasyncio.run(rerank(array[0]))\n\n\n### Image Embeddings: CLIP models\n\nCLIP models are able to encode images and text at the same time.\n\npython\nimport asyncio\nfrom infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine\n\nsentences = [\"This is awesome.\", \"I am bored.\"]\nimages = [\"http://images.cocodataset.org/val2017/000000039769.jpg\"]\nengine_args = EngineArgs(\n    model_name_or_path = \"wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M\", \n    engine=\"torch\"\n)\narray = AsyncEngineArray.from_args([engine_args])\n\nasync def embed(engine: AsyncEmbeddingEngine): \n    await engine.astart()\n    embeddings, usage = await engine.embed(sentences=sentences)\n    embeddings_image, _ = await engine.image_embed(images=images)\n    await engine.astop()\n\nasyncio.run(embed(array[\"wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M\"]))\n\n\n### Audio Embeddings: CLAP models\n\nCLAP models are able to encode audio and text at the same time.\n\npython\nimport asyncio\nfrom infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine\nimport requests\nimport soundfile as sf\nimport io\n\nsentences = [\"This is awesome.\", \"I am bored.\"]\n\nurl = \"https://bigsoundbank.com/UPLOAD/wav/2380.wav\"\nraw_bytes = requests.get(url, stream=True).content\n\naudios = [raw_bytes]\nengine_args = EngineArgs(\n    model_name_or_path = \"laion/clap-htsat-unfused\",\n    dtype=\"float32\", \n    engine=\"torch\"\n\n)\narray = AsyncEngineArray.from_args([engine_args])\n\nasync def embed(engine: AsyncEmbeddingEngine): \n    await engine.astart()\n    embeddings, usage = await engine.embed(sentences=sentences)\n    embedding_audios = await engine.audio_embed(audios=audios)\n    await engine.astop()\n\nasyncio.run(embed(array[\"laion/clap-htsat-unfused\"]))\n\n\n### Text Classification\n\nUse text classification with Infinity's `classify` feature, which allows for sentiment analysis, emotion detection, and more classification tasks.\n\npython\nimport asyncio\nfrom infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine\n\nsentences = [\"This is awesome.\", \"I am bored.\"]\nengine_args = EngineArgs(\n    model_name_or_path = \"SamLowe/roberta-base-go_emotions\", \n    engine=\"torch\", model_warmup=True)\narray = AsyncEngineArray.from_args([engine_args])\n\nasync def classifier(engine: AsyncEmbeddingEngine): \n    async with engine:\n        predictions, usage = await engine.classify(sentences=sentences)\n    # or handle the async start / stop yourself.\n    await engine.astart()\n    predictions, usage = await engine.classify(sentences=sentences)\n    await engine.astop()\nasyncio.run(classifier(array[\"SamLowe/roberta-base-go_emotions\"]))\n\n\n## Why Use Infinity?\n\nInfinity offers several compelling features for deploying and managing AI models:\n\n*   **Deploy Any HuggingFace Model**: Easily deploy any embedding, reranking, CLIP, or sentence-transformer model available on HuggingFace.\n*   **Fast Inference Backends**: Built on PyTorch, Optimum (ONNX/TensorRT), and CTranslate2, Infinity leverages FlashAttention for optimal performance on NVIDIA CUDA, AMD ROCM, CPU, AWS INF2, or APPLE MPS accelerators. It also uses dynamic batching and dedicated worker threads.\n*   **Multi-modal and Multi-model Support**: Mix and match multiple models, including multi-modal capabilities for image and audio embeddings. Infinity orchestrates their execution seamlessly.\n*   **Tested Implementation**: The project boasts unit and end-to-end testing, ensuring accurate and reliable embeddings.\n*   **Easy to Use**: Built on FastAPI, Infinity provides a user-friendly CLI and an OpenAPI-aligned API specification, making it simple to integrate and manage.\n\n## Links\n\n*   **GitHub Repository**: [https://github.com/michaelfeil/infinity](https://github.com/michaelfeil/infinity)\n*   **Official Documentation**: [https://michaelfeil.github.io/infinity](https://michaelfeil.github.io/infinity)\n*   **Python Client**: [https://github.com/michaelfeil/infinity/tree/main/libs/client_infinity/infinity_client](https://github.com/michaelfeil/infinity/tree/main/libs/client_infinity/infinity_client)\n*   **Integrations**:\n    *   [Runpod Serverless Deployments](https://github.com/runpod-workers/worker-infinity-embedding)\n    *   [Truefoundry Cognita](https://github.com/truefoundry/cognita)\n    *   [Langchain Example](https://python.langchain.com/docs/integrations/text_embedding/infinity)\n    *   [imitater - Unified Language Model Server](https://github.com/the-seeds/imitater)\n    *   [Dwarves Foundation: LLM Hosting Examples](https://github.com/dwarvesf/llm-hosting)","metrics":{"detailViews":2,"githubClicks":4},"dates":{"published":null,"modified":"2026-03-17T16:56:30.000Z"}}