RouteLLM: Optimize LLM Costs and Maintain Quality with Intelligent Routing

Introduction

RouteLLM is an innovative framework designed for serving and evaluating Large Language Model (LLM) routers. It addresses the common dilemma faced when deploying LLMs: balancing the high costs of powerful models like GPT-4 with the potentially lower quality of cheaper alternatives. RouteLLM intelligently routes simpler queries to smaller, more cost-effective models, significantly reducing operational expenses while maintaining high-quality responses.

This framework has demonstrated impressive results, capable of reducing LLM costs by up to 85% while preserving 95% of GPT-4's performance on widely-used benchmarks. It also achieves comparable performance to commercial offerings at a substantially lower cost, making it a powerful tool for optimizing LLM deployments.

For more details, you can refer to the official blog post and the research paper.

Installation

Getting started with RouteLLM is straightforward. You can install it via PyPI or directly from the source.

From PyPI

pip install "routellm[serve,eval]"

From source

git clone https://github.com/lm-sys/RouteLLM.git
cd RouteLLM
pip install -e .[serve,eval]

Examples

RouteLLM offers flexible ways to integrate LLM routing into your applications, either by replacing an existing OpenAI client or by launching an OpenAI-compatible server.

Python Client Replacement

Here's a quick walkthrough on how to replace your existing OpenAI client to route queries between LLMs using RouteLLM.

Initialize the Controller: Replace your OpenAI client by initializing the RouteLLM controller with a router, for example, the mf router.

import os
from routellm.controller import Controller

os.environ["OPENAI_API_KEY"] = "sk-XXXXXX"
# Replace with your model provider, we use Anyscale's Mixtral here.
os.environ["ANYSCALE_API_KEY"] = "esecret_XXXXXX"

client = Controller(
  routers=["mf"],
  strong_model="gpt-4-1106-preview",
  weak_model="anyscale/mistralai/Mixtral-8x7B-Instruct-v0.1",
)

You can customize the strong and weak models, as well as their providers.

Calibrate the Cost Threshold: Each routing request uses a cost threshold to control the tradeoff between cost and quality. Calibrate this threshold based on your specific query types. For instance, to calibrate for 50% GPT-4 calls using Chatbot Arena data:
```
python -m routellm.calibrate_threshold --routers mf --strong-model-pct 0.5 --config config.example.yaml
```
This command will output the recommended threshold value.
Make a Routed Request: Update the model field in your completion requests to specify the router and the calibrated threshold.
```
response = client.chat.completions.create(
  # This tells RouteLLM to use the MF router with a cost threshold of 0.11593
  model="router-mf-0.11593",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)
```
This setup ensures requests are routed dynamically, saving costs while maintaining high response quality.

Server & Demo

Alternatively, you can launch an OpenAI-compatible server that works with any existing OpenAI client.

Launch the Server:

export OPENAI_API_KEY=sk-XXXXXX
export ANYSCALE_API_KEY=esecret_XXXXXX
python -m routellm.openai_server --routers mf --strong-model gpt-4-1106-preview --weak-model anyscale/mistralai/Mixtral-8x7B-Instruct-v0.1

The server will start on http://0.0.0.0:6060.

Start a Local Router Chatbot Demo:
```
python -m examples.router_chat --router mf --threshold 0.11593
```
This allows you to interact with the router and observe how different messages are handled.

Why Use RouteLLM

RouteLLM provides compelling advantages for anyone deploying LLMs in production:

Significant Cost Savings: Achieve up to 85% cost reduction without sacrificing quality, by intelligently routing queries to the most appropriate model.
High Performance: Maintain 95% of GPT-4's performance on key benchmarks, ensuring your applications deliver top-tier results.
OpenAI Client Compatibility: Seamlessly integrate RouteLLM into existing applications as a drop-in replacement for OpenAI's client or by using its OpenAI-compatible server.
Extensive Model Support: Leverage LiteLLM to support a wide range of open-source and closed models from various providers, including local models via Ollama.
Pre-trained Routers: Benefit from out-of-the-box trained routers, with the mf router being highly recommended for its strength and lightweight nature. These routers generalize well to different model pairs.
Customizable Routing Strategies: Easily extend the framework to include new routers and compare their performance across multiple benchmarks.
Threshold Calibration: Fine-tune the cost-quality tradeoff by calibrating routing thresholds based on your specific dataset and desired strong model call percentage.
Comprehensive Evaluation Framework: Evaluate different routing strategies on benchmarks like MMLU, GSM8K, and MT-Bench to ensure optimal performance.