Evidently: Open-Source ML and LLM Observability Framework

Introduction

Evidently is a powerful open-source Python library that serves as an ML and LLM observability framework. It enables users to evaluate, test, and monitor any AI-powered system or data pipeline, from tabular data to Generative AI applications. With over 100 built-in metrics, Evidently supports both offline evaluations and live monitoring, offering a modular architecture for various use cases.

Installation

To get started with Evidently, you can install it using pip or Conda.

pip install evidently

Alternatively, for Conda users:

conda install -c conda-forge evidently

To run the Evidently UI with demo projects, you can use uv or a standard virtual environment:

uv run --with evidently evidently ui --demo-projects all

If uv is not installed, set up a virtual environment:

pip install virtualenv
virtualenv venv
source venv/bin/activate
pip install evidently
evidently ui --demo-projects all

Then, visit localhost:8000 in your browser.

Examples

Evidently offers comprehensive tools for both LLM and traditional ML/data evaluations, along with a monitoring dashboard.

LLM Evaluations

Here's a quick example for LLM evaluations, checking sentiment, text length, and specific word presence in responses.

import pandas as pd
from evidently import Report
from evidently import Dataset, DataDefinition
from evidently.descriptors import Sentiment, TextLength, Contains
from evidently.presets import TextEvals

eval_df = pd.DataFrame([
    ["What is the capital of Japan?", "The capital of Japan is Tokyo."],
    ["Who painted the Mona Lisa?", "Leonardo da Vinci."],
    ["Can you write an essay?", "I'm sorry, but I can't assist with homework."]],
                       columns=["question", "answer"])

eval_dataset = Dataset.from_pandas(pd.DataFrame(eval_df),
data_definition=DataDefinition(),
descriptors=[
    Sentiment("answer", alias="Sentiment"),
    TextLength("answer", alias="Length"),
    Contains("answer", items=['sorry', 'apologize'], mode="any", alias="Denials")
])

report = Report([
    TextEvals()
])

my_eval = report.run(eval_dataset)
my_eval

Data and ML Evaluations

For data and ML evaluations, Evidently can detect data drift using various statistical methods.

import pandas as pd
from sklearn import datasets

from evidently import Report
from evidently.presets import DataDriftPreset

iris_data = datasets.load_iris(as_frame=True)
iris_frame = iris_data.frame

report = Report([
    DataDriftPreset(method="psi")
],
include_tests="True")
my_eval = report.run(iris_frame.iloc[:60], iris_frame.iloc[60:])
my_eval

You can also save reports as HTML files using my_eval.save_html("file.html").

Monitoring Dashboard

Evidently also provides a Monitoring UI service to visualize metrics and test results over time. You can self-host the open-source version or use Evidently Cloud for additional features like dataset management, alerting, and no-code evaluations.

Why Use Evidently

Evidently offers a comprehensive suite of tools for evaluating various aspects of AI systems, making it invaluable for maintaining model quality and reliability. With over 100 built-in evaluations, and the ability to add custom ones, it covers a wide range of needs. It works with tabular and text data, supports evaluations for predictive and generative tasks, and provides both offline evaluations and live monitoring.

Key evaluation capabilities include:

Text descriptors: Length, sentiment, toxicity, language, special symbols, regular expression matches.
LLM outputs: Semantic similarity, retrieval relevance, summarization quality, using model- and LLM-based evaluations.
Data quality: Missing values, duplicates, min-max ranges, new categorical values, correlations.
Data distribution drift: Over 20 statistical tests and distance metrics to compare shifts in data distribution.
Classification: Accuracy, precision, recall, ROC AUC, confusion matrix, bias.
Regression: MAE, ME, RMSE, error distribution, error normality, error bias.
Ranking (including RAG): NDCG, MAP, MRR, Hit Rate.
Recommendations: Serendipity, novelty, diversity, popularity bias.