AgentEvals: Robust Evaluation Tools for LLM Agent Trajectories

Introduction

Agentic applications grant Large Language Models (LLMs) the freedom to control their own flow to solve complex problems. While this approach is incredibly powerful, the inherent black-box nature of LLMs can make it challenging to understand how changes impact an agent's behavior downstream. This makes robust evaluation critically important.

AgentEvals, developed by LangChain, offers a comprehensive suite of evaluators and utilities specifically designed for assessing the performance of your agents. Its primary focus is on agent trajectory, examining the intermediate steps an agent takes during its execution. This package serves as an excellent conceptual starting point for your agent's evaluation strategy. For more general evaluation tools, consider checking out its companion package, openevals.

Installation

Getting started with AgentEvals is straightforward. You can install it using pip for Python or npm for TypeScript:

Python

pip install agentevals

TypeScript

npm install agentevals @langchain/core

For LLM-as-judge evaluators, you will also need an LLM client. By default, agentevals uses LangChain chat model integrations and comes with langchain_openai pre-installed. Alternatively, you can install the OpenAI client directly:

Python

pip install openai

TypeScript

npm install openai

It is also beneficial to be familiar with evaluation concepts and LangSmith's pytest integration for running evaluations, which is documented here.

Examples

AgentEvals provides various evaluators, including trajectory match evaluators and LLM-as-judge evaluators. Here is a quick example demonstrating how to use an LLM-as-judge evaluator to assess the accuracy of an agent's trajectory.

First, set your OpenAI API key as an environment variable:

export OPENAI_API_KEY="your_openai_api_key"

Then, you can run your first trajectory evaluator. Agent trajectories are represented as a list of OpenAI-style messages:

from agentevals.trajectory.llm import create_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT
import json

trajectory_evaluator = create_trajectory_llm_as_judge(
    prompt=TRAJECTORY_ACCURACY_PROMPT,
    model="openai:o3-mini",
)

# This is a fake trajectory, in reality you would run your agent to get a real trajectory
outputs = [
    {"role": "user", "content": "What is the weather in SF?"},
    {
        "role": "assistant",
        "content": "",
        "tool_calls": [
            {
                "function": {
                    "name": "get_weather",
                    "arguments": json.dumps({"city": "SF"}),
                }
            }
        ],
    },
    {"role": "tool", "content": "It's 80 degrees and sunny in SF."},
    {"role": "assistant", "content": "The weather in SF is 80 degrees and sunny."},
]

eval_result = trajectory_evaluator(
  outputs=outputs,
)

print(eval_result)

This will output a result similar to:


{
  'key': 'trajectory_accuracy',
  'reasoning': 'The trajectory accurately follows the user's request for weather information in SF. Initially, the assistant recognizes the goal (providing weather details), then it efficiently makes a tool call to get the weather, and finally it communicates the result clearly. All steps demonstrate logical progression and efficiency. Thus, the score should be: true.',
  'score': true
}

AgentEvals also offers various trajectory_match_mode options (strict, unordered, subset, superset) and tool_args_match_modes for fine-grained control over how trajectories are compared against references. Additionally, it includes Graph Trajectory evaluators, designed for frameworks like LangGraph, which allow evaluation based on nodes visited rather than just messages.

Why Use AgentEvals

Evaluating LLM agents is crucial for developing reliable and high-performing AI applications. AgentEvals provides a structured approach to this challenge:

Understand Agent Behavior: By focusing on trajectories, you gain insight into the decision-making process and intermediate steps of your agents.
Ensure Reliability: Implement automated checks to verify that agents consistently perform as expected, reducing unexpected behaviors.
Facilitate Debugging: Pinpoint exactly where an agent's trajectory deviates from the desired path, making debugging more efficient.
Improve Performance: Use evaluation results to iterate on agent design, leading to more efficient and accurate agentic applications.
LangSmith Integration: Seamlessly integrate with LangSmith for experiment tracking, detailed tracing, and comprehensive evaluation reporting over time.