judges: A Python Library for LLM-as-a-Judge Evaluators

Introduction

The judges library, developed by Databricks, is a compact yet powerful Python tool designed for using and creating LLM-as-a-Judge evaluators. Its primary goal is to provide a curated collection of research-backed LLM evaluators in a low-friction format, suitable for a wide range of use cases. Whether you need off-the-shelf evaluation capabilities or inspiration to build your own custom LLM evaluators, judges offers a robust solution. The library supports two main types of judges: Classifiers, which return boolean values, and Graders, which provide numerical or Likert scale scores. It also features a Jury object for combining multiple judges and AutoJudge for creating custom, task-specific evaluators from datasets.

Installation

Getting started with judges is straightforward. You can install the core library using pip:

pip install judges

If you plan to use the AutoJudge feature for creating custom LLM judges, install it with the auto extra:

pip install "judges[auto]"

Examples

Using a Classifier Judge

judges provides various pre-built classifiers to evaluate model outputs. Here's an example using PollMultihopCorrectness to check if a model's response is factually correct:

from openai import OpenAI
from judges.classifiers.correctness import PollMultihopCorrectness

client = OpenAI()

question = "What is the name of the rabbit in the following story. Respond with 'I don't know' if you don't know."
story = """
Fig was a small, scruffy dog with a big personality. He lived in a quiet little town where everyone knew his name. Fig loved adventures, and every day he would roam the neighborhood, wagging his tail and sniffing out new things to explore.

One day, Fig discovered a mysterious trail of footprints leading into the woods. Curiosity got the best of him, and he followed them deep into the trees. As he trotted along, he heard rustling in the bushes and suddenly, out popped a rabbit! The rabbit looked at Fig with wide eyes and darted off.

But instead of chasing it, Fig barked in excitement, as if saying, "Nice to meet you!" The rabbit stopped, surprised, and came back. They sat together for a moment, sharing the calm of the woods.

From that day on, Fig had a new friend. Every afternoon, the two of them would meet in the same spot, enjoying the quiet companionship of an unlikely friendship. Fig's adventurous heart had found a little peace in the simple joy of being with his new friend.
"""
input_prompt = f'{story}\n\nQuestion:{question}'
expected = "I don't know"

output = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[
        {'role': 'user', 'content': input_prompt},
    ],
).choices[0].message.content

correctness = PollMultihopCorrectness(model='openai/gpt-4o-mini')
judgment = correctness.judge(
    input=input_prompt,
    output=output,
    expected=expected,
)
print(judgment.reasoning)
print(judgment.score)

Combining Judges with Jury

For more diverse and robust evaluations, you can combine multiple judges using the Jury object. This allows for averaging or diversifying judgments.

from judges import Jury
from judges.classifiers.correctness import PollMultihopCorrectness, RAFTCorrectness

# Assuming 'input_prompt', 'output', and 'expected' are defined as in the previous example
poll = PollMultihopCorrectness(model='openai/gpt-4o')
raft = RAFTCorrectness(model='openai/gpt-4o-mini')

jury = Jury(judges=[poll, raft], voting_method="average")

verdict = jury.vote(
    input=input_prompt,
    output=output,
    expected=expected,
)
print(verdict.score)

Creating Custom Judges with AutoJudge

AutoJudge simplifies the creation of custom, task-specific LLM judges by leveraging a labeled dataset and a natural language description of the evaluation task.

from judges.classifiers.auto import AutoJudge

dataset = [
    {
        "input": "Can I ride a dragon in Scotland?",
        "output": "Yes, dragons are commonly seen in the highlands and can be ridden with proper training.",
        "label": 0,
        "feedback": "Dragons are mythical creatures; the information is fictional.",
    },
    {
        "input": "Can you recommend a good hotel in Tokyo?",
        "output": "Certainly! Hotel Sunroute Plaza Shinjuku is highly rated for its location and amenities. It offers comfortable rooms and excellent service.",
        "label": 1,
        "feedback": "Offers a specific and helpful recommendation.",
    },
]

task = "Evaluate responses for accuracy, clarity, and helpfulness."

autojudge = AutoJudge.from_dataset(
    dataset=dataset,
    task=task,
    model="openai/gpt-4-turbo-2024-04-09",
)

input_ = "What are the top attractions in New York City?"
output = "Some top attractions in NYC include the Statue of Liberty and Central Park."

judgment = autojudge.judge(input=input_, output=output)
print(judgment.reasoning)
print(judgment.score)

Building a Custom Judge

For complete control, you can define your own judge by inheriting from BaseJudge and implementing the .judge() method. This allows you to specify custom evaluation criteria and prompts.

from textwrap import dedent
from judges.base import BaseJudge, Judgment

class PolitenessJudge(BaseJudge):
    """
    A judge that evaluates the politeness and respectfulness of model responses.
    """
    
    def judge(
        self,
        input: str,
        output: str = None,
        expected: str = None,
    ) -> Judgment:
        system_prompt = "You are an expert in communication and social etiquette."
        
        user_prompt = dedent(
            f"""
            Evaluate whether the following response is polite and respectful.
            
            Original question: {input}
            Response to evaluate: {output}
            
            Consider factors like:
            - Use of courteous language
            - Respectful tone
            - Appropriate level of formality
            - Absence of rude or dismissive language
            
            Return "True" if the response is polite and respectful, "False" otherwise.
            """
        )
        
        reasoning, score = self._judge(
            user_prompt=user_prompt,
            system_prompt=system_prompt,
        )
        
        return Judgment(reasoning=reasoning, score=score, score_type="boolean")

# Example usage:
politeness_judge = PolitenessJudge(model='openai/gpt-4o-mini')
judgment = politeness_judge.judge(
    input="Can you help me with my homework?",
    output="Sure! I'd be happy to help you with your homework. What subject are you working on?"
)
print(judgment.reasoning)
print(judgment.score)

CLI Usage

The judges library also offers a command-line interface for quick evaluations, supporting both single and batch processing.

judges PollMultihopCorrectness -m gpt-4o-mini -i '[
    {
        "input": "What is the capital of France?",
        "output": "The capital of France is Madrid.",
        "expected": "The capital of France is Paris."
    },
    {
        "input": "What is the capital of Germany?",
        "output": "The capital of Germany is Paris.",
        "expected": "The capital of Germany is Berlin."
    }
]'

Why Use `judges`?

The judges library streamlines the complex task of evaluating Large Language Models. By providing a structured, research-backed approach, it helps developers ensure their LLM applications are performing as expected. Its key advantages include:

Curated Evaluators: Access to a set of pre-built, research-backed LLM judges for common evaluation tasks like factual correctness, hallucination, and harmfulness.
Flexibility: Easily use off-the-shelf judges, combine them into a Jury for diversified results, or create highly customized evaluators with AutoJudge or by extending the BaseJudge class.
Structured Output: Judges return Judgment objects with clear reasoning and scores, facilitating automated analysis and feedback loops.
Ease of Use: Simple installation and a clear API make it accessible for developers of all experience levels.
CLI Support: Integrate LLM evaluation into your workflows directly from the command line.

judges: A Python Library for LLM-as-a-Judge Evaluators

Summary

Repository Info

Tags

Introduction

Installation

Examples

Using a Classifier Judge

Combining Judges with Jury

Creating Custom Judges with AutoJudge

Building a Custom Judge

CLI Usage

Why Use `judges`?

Links

Summary

Repository Info

Tags

Introduction

Installation

Examples

Using a Classifier Judge

Combining Judges with Jury

Creating Custom Judges with AutoJudge

Building a Custom Judge

CLI Usage

Why Use judges?

Links

Why Use `judges`?