judges: A Python Library for LLM-as-a-Judge Evaluators

Summary
The `judges` library from Databricks provides a concise and powerful way to use and create LLM-as-a-Judge evaluators. It offers a curated set of pre-built judges for various use cases, backed by research, and supports both off-the-shelf usage and custom judge creation. This tool helps developers effectively evaluate the performance and quality of their Large Language Models.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
The judges library, developed by Databricks, is a compact yet powerful Python tool designed for using and creating LLM-as-a-Judge evaluators. Its primary goal is to provide a curated collection of research-backed LLM evaluators in a low-friction format, suitable for a wide range of use cases. Whether you need off-the-shelf evaluation capabilities or inspiration to build your own custom LLM evaluators, judges offers a robust solution. The library supports two main types of judges: Classifiers, which return boolean values, and Graders, which provide numerical or Likert scale scores. It also features a Jury object for combining multiple judges and AutoJudge for creating custom, task-specific evaluators from datasets.
Installation
Getting started with judges is straightforward. You can install the core library using pip:
pip install judges
If you plan to use the AutoJudge feature for creating custom LLM judges, install it with the auto extra:
pip install "judges[auto]"
Examples
Using a Classifier Judge
judges provides various pre-built classifiers to evaluate model outputs. Here's an example using PollMultihopCorrectness to check if a model's response is factually correct:
from openai import OpenAI
from judges.classifiers.correctness import PollMultihopCorrectness
client = OpenAI()
question = "What is the name of the rabbit in the following story. Respond with 'I don't know' if you don't know."
story = """
Fig was a small, scruffy dog with a big personality. He lived in a quiet little town where everyone knew his name. Fig loved adventures, and every day he would roam the neighborhood, wagging his tail and sniffing out new things to explore.
One day, Fig discovered a mysterious trail of footprints leading into the woods. Curiosity got the best of him, and he followed them deep into the trees. As he trotted along, he heard rustling in the bushes and suddenly, out popped a rabbit! The rabbit looked at Fig with wide eyes and darted off.
But instead of chasing it, Fig barked in excitement, as if saying, "Nice to meet you!" The rabbit stopped, surprised, and came back. They sat together for a moment, sharing the calm of the woods.
From that day on, Fig had a new friend. Every afternoon, the two of them would meet in the same spot, enjoying the quiet companionship of an unlikely friendship. Fig's adventurous heart had found a little peace in the simple joy of being with his new friend.
"""
input_prompt = f'{story}\n\nQuestion:{question}'
expected = "I don't know"
output = client.chat.completions.create(
model='gpt-4o-mini',
messages=[
{'role': 'user', 'content': input_prompt},
],
).choices[0].message.content
correctness = PollMultihopCorrectness(model='openai/gpt-4o-mini')
judgment = correctness.judge(
input=input_prompt,
output=output,
expected=expected,
)
print(judgment.reasoning)
print(judgment.score)
Combining Judges with Jury
For more diverse and robust evaluations, you can combine multiple judges using the Jury object. This allows for averaging or diversifying judgments.
from judges import Jury
from judges.classifiers.correctness import PollMultihopCorrectness, RAFTCorrectness
# Assuming 'input_prompt', 'output', and 'expected' are defined as in the previous example
poll = PollMultihopCorrectness(model='openai/gpt-4o')
raft = RAFTCorrectness(model='openai/gpt-4o-mini')
jury = Jury(judges=[poll, raft], voting_method="average")
verdict = jury.vote(
input=input_prompt,
output=output,
expected=expected,
)
print(verdict.score)
Creating Custom Judges with AutoJudge
AutoJudge simplifies the creation of custom, task-specific LLM judges by leveraging a labeled dataset and a natural language description of the evaluation task.
from judges.classifiers.auto import AutoJudge
dataset = [
{
"input": "Can I ride a dragon in Scotland?",
"output": "Yes, dragons are commonly seen in the highlands and can be ridden with proper training.",
"label": 0,
"feedback": "Dragons are mythical creatures; the information is fictional.",
},
{
"input": "Can you recommend a good hotel in Tokyo?",
"output": "Certainly! Hotel Sunroute Plaza Shinjuku is highly rated for its location and amenities. It offers comfortable rooms and excellent service.",
"label": 1,
"feedback": "Offers a specific and helpful recommendation.",
},
]
task = "Evaluate responses for accuracy, clarity, and helpfulness."
autojudge = AutoJudge.from_dataset(
dataset=dataset,
task=task,
model="openai/gpt-4-turbo-2024-04-09",
)
input_ = "What are the top attractions in New York City?"
output = "Some top attractions in NYC include the Statue of Liberty and Central Park."
judgment = autojudge.judge(input=input_, output=output)
print(judgment.reasoning)
print(judgment.score)
Building a Custom Judge
For complete control, you can define your own judge by inheriting from BaseJudge and implementing the .judge() method. This allows you to specify custom evaluation criteria and prompts.
from textwrap import dedent
from judges.base import BaseJudge, Judgment
class PolitenessJudge(BaseJudge):
"""
A judge that evaluates the politeness and respectfulness of model responses.
"""
def judge(
self,
input: str,
output: str = None,
expected: str = None,
) -> Judgment:
system_prompt = "You are an expert in communication and social etiquette."
user_prompt = dedent(
f"""
Evaluate whether the following response is polite and respectful.
Original question: {input}
Response to evaluate: {output}
Consider factors like:
- Use of courteous language
- Respectful tone
- Appropriate level of formality
- Absence of rude or dismissive language
Return "True" if the response is polite and respectful, "False" otherwise.
"""
)
reasoning, score = self._judge(
user_prompt=user_prompt,
system_prompt=system_prompt,
)
return Judgment(reasoning=reasoning, score=score, score_type="boolean")
# Example usage:
politeness_judge = PolitenessJudge(model='openai/gpt-4o-mini')
judgment = politeness_judge.judge(
input="Can you help me with my homework?",
output="Sure! I'd be happy to help you with your homework. What subject are you working on?"
)
print(judgment.reasoning)
print(judgment.score)
CLI Usage
The judges library also offers a command-line interface for quick evaluations, supporting both single and batch processing.
judges PollMultihopCorrectness -m gpt-4o-mini -i '[
{
"input": "What is the capital of France?",
"output": "The capital of France is Madrid.",
"expected": "The capital of France is Paris."
},
{
"input": "What is the capital of Germany?",
"output": "The capital of Germany is Paris.",
"expected": "The capital of Germany is Berlin."
}
]'
Why Use judges?
The judges library streamlines the complex task of evaluating Large Language Models. By providing a structured, research-backed approach, it helps developers ensure their LLM applications are performing as expected. Its key advantages include:
- Curated Evaluators: Access to a set of pre-built, research-backed LLM judges for common evaluation tasks like factual correctness, hallucination, and harmfulness.
- Flexibility: Easily use off-the-shelf judges, combine them into a
Juryfor diversified results, or create highly customized evaluators withAutoJudgeor by extending theBaseJudgeclass. - Structured Output: Judges return
Judgmentobjects with clear reasoning and scores, facilitating automated analysis and feedback loops. - Ease of Use: Simple installation and a clear API make it accessible for developers of all experience levels.
- CLI Support: Integrate LLM evaluation into your workflows directly from the command line.
Links
- GitHub Repository: databricks/judges
- Discord Community: Join the Discord