Repository History

Explore all analyzed open source repositories

Topic: AI Evaluation

judges: A Python Library for LLM-as-a-Judge Evaluators

The `judges` library from Databricks provides a concise and powerful way to use and create LLM-as-a-Judge evaluators. It offers a curated set of pre-built judges for various use cases, backed by research, and supports both off-the-shelf usage and custom judge creation. This tool helps developers effectively evaluate the performance and quality of their Large Language Models.

Apr 14, 2026

View Details

Judgy: Correcting LLM Judge Bias for Reliable AI Model Evaluation

Judgy is a Python package designed to improve the reliability of evaluations performed by LLM-as-Judges. It provides tools to estimate the true success rate of a system by correcting for LLM judge bias and generating confidence intervals through bootstrapping. This helps ensure more accurate and trustworthy assessments of AI model performance.

Dec 7, 2025

View Details

Page 1