Repository History
Explore all analyzed open source repositories

judges: A Python Library for LLM-as-a-Judge Evaluators
The `judges` library from Databricks provides a concise and powerful way to use and create LLM-as-a-Judge evaluators. It offers a curated set of pre-built judges for various use cases, backed by research, and supports both off-the-shelf usage and custom judge creation. This tool helps developers effectively evaluate the performance and quality of their Large Language Models.

Judgy: Correcting LLM Judge Bias for Reliable AI Model Evaluation
Judgy is a Python package designed to improve the reliability of evaluations performed by LLM-as-Judges. It provides tools to estimate the true success rate of a system by correcting for LLM judge bias and generating confidence intervals through bootstrapping. This helps ensure more accurate and trustworthy assessments of AI model performance.