RouteLLM: Optimize LLM Costs and Maintain Quality with Intelligent Routing
This repository profile is provided by osrepos.com, an open source repository discovery platform.

Summary
RouteLLM is a powerful framework designed to serve and evaluate LLM routers, enabling significant cost savings without compromising response quality. It intelligently routes simpler queries to cheaper models while maintaining high performance, offering a drop-in replacement for existing OpenAI clients or a compatible server. This solution helps balance the dilemma of LLM deployment costs versus model capabilities.
Repository Information
Topics
Click on any tag to explore related repositories
Use at your own risk
OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.
Introduction
RouteLLM is an innovative framework designed for serving and evaluating Large Language Model (LLM) routers. It addresses the common dilemma faced when deploying LLMs: balancing the high costs of powerful models like GPT-4 with the potentially lower quality of cheaper alternatives. RouteLLM intelligently routes simpler queries to smaller, more cost-effective models, significantly reducing operational expenses while maintaining high-quality responses.
This framework has demonstrated impressive results, capable of reducing LLM costs by up to 85% while preserving 95% of GPT-4's performance on widely-used benchmarks. It also achieves comparable performance to commercial offerings at a substantially lower cost, making it a powerful tool for optimizing LLM deployments.
For more details, you can refer to the official blog post and the research paper.
Installation
Getting started with RouteLLM is straightforward. You can install it via PyPI or directly from the source.
From PyPI
pip install "routellm[serve,eval]"
From source
git clone https://github.com/lm-sys/RouteLLM.git
cd RouteLLM
pip install -e .[serve,eval]
Examples
RouteLLM offers flexible ways to integrate LLM routing into your applications, either by replacing an existing OpenAI client or by launching an OpenAI-compatible server.
Python Client Replacement
Here's a quick walkthrough on how to replace your existing OpenAI client to route queries between LLMs using RouteLLM.
-
Initialize the Controller: Replace your OpenAI client by initializing the RouteLLM controller with a router, for example, the
mfrouter.import os from routellm.controller import Controller os.environ["OPENAI_API_KEY"] = "sk-XXXXXX" # Replace with your model provider, we use Anyscale's Mixtral here. os.environ["ANYSCALE_API_KEY"] = "esecret_XXXXXX" client = Controller( routers=["mf"], strong_model="gpt-4-1106-preview", weak_model="anyscale/mistralai/Mixtral-8x7B-Instruct-v0.1", )You can customize the strong and weak models, as well as their providers.
-
Calibrate the Cost Threshold: Each routing request uses a cost threshold to control the tradeoff between cost and quality. Calibrate this threshold based on your specific query types. For instance, to calibrate for 50% GPT-4 calls using Chatbot Arena data:
python -m routellm.calibrate_threshold --routers mf --strong-model-pct 0.5 --config config.example.yamlThis command will output the recommended threshold value.
-
Make a Routed Request: Update the
modelfield in your completion requests to specify the router and the calibrated threshold.response = client.chat.completions.create( # This tells RouteLLM to use the MF router with a cost threshold of 0.11593 model="router-mf-0.11593", messages=[ {"role": "user", "content": "Hello!"} ] )This setup ensures requests are routed dynamically, saving costs while maintaining high response quality.
Server & Demo
Alternatively, you can launch an OpenAI-compatible server that works with any existing OpenAI client.
-
Launch the Server:
export OPENAI_API_KEY=sk-XXXXXX export ANYSCALE_API_KEY=esecret_XXXXXX python -m routellm.openai_server --routers mf --strong-model gpt-4-1106-preview --weak-model anyscale/mistralai/Mixtral-8x7B-Instruct-v0.1The server will start on
http://0.0.0.0:6060. -
Start a Local Router Chatbot Demo:
python -m examples.router_chat --router mf --threshold 0.11593This allows you to interact with the router and observe how different messages are handled.
Why Use RouteLLM
RouteLLM provides compelling advantages for anyone deploying LLMs in production:
- Significant Cost Savings: Achieve up to 85% cost reduction without sacrificing quality, by intelligently routing queries to the most appropriate model.
- High Performance: Maintain 95% of GPT-4's performance on key benchmarks, ensuring your applications deliver top-tier results.
- OpenAI Client Compatibility: Seamlessly integrate RouteLLM into existing applications as a drop-in replacement for OpenAI's client or by using its OpenAI-compatible server.
- Extensive Model Support: Leverage LiteLLM to support a wide range of open-source and closed models from various providers, including local models via Ollama.
- Pre-trained Routers: Benefit from out-of-the-box trained routers, with the
mfrouter being highly recommended for its strength and lightweight nature. These routers generalize well to different model pairs. - Customizable Routing Strategies: Easily extend the framework to include new routers and compare their performance across multiple benchmarks.
- Threshold Calibration: Fine-tune the cost-quality tradeoff by calibrating routing thresholds based on your specific dataset and desired strong model call percentage.
- Comprehensive Evaluation Framework: Evaluate different routing strategies on benchmarks like MMLU, GSM8K, and MT-Bench to ensure optimal performance.
Links
- GitHub Repository: lm-sys/RouteLLM
- Official Blog Post: RouteLLM Blog
- Research Paper: RouteLLM Paper
Related repositories
Similar repositories that may be relevant next.

Memoripy: An AI Memory Layer for Context-Aware Applications
July 5, 2026
Memoripy is a Python library designed to provide an AI memory layer for context-aware applications. It offers both short-term and long-term storage, semantic clustering, and optional memory decay. This robust tool helps AI systems manage and retrieve relevant information efficiently, supporting various LLM APIs like OpenAI and Ollama.

RAGChecker: A Fine-grained Framework for Diagnosing RAG Systems
July 4, 2026
RAGChecker is an advanced automatic evaluation framework developed by Amazon Science, specifically designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It offers a comprehensive suite of metrics and tools for in-depth analysis of RAG performance. This framework empowers developers and researchers to thoroughly evaluate and enhance their RAG systems with precision.

rerankers: Unified API for Reranking and Cross-Encoder Models
July 4, 2026
rerankers is a lightweight, low-dependency Python library that provides a unified API for various reranking and cross-encoder models. It simplifies the integration of different reranking approaches into retrieval architectures, offering a consistent interface for diverse models like cross-encoders, RankGPT, T5, and API-based rerankers. This library aims to make reranking more accessible and easier to implement for developers.

LLM Compressor: Optimize LLMs for Deployment with vLLM
July 4, 2026
LLM Compressor is a Transformers-compatible Python library designed to apply various compression algorithms to Large Language Models (LLMs). It enables optimized deployment, especially with vLLM, by offering a comprehensive set of quantization techniques for weights, activations, and KV Cache. This tool seamlessly integrates with Hugging Face models, making LLM optimization accessible and efficient.
Source repository
Open the original repository on GitHub.