# AgentEvals: Robust Evaluation Tools for LLM Agent Trajectories

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Source: osrepos.com
Repository profile: https://osrepos.com/repo/langchain-ai-agentevals
Generated for open source discovery and AI-assisted research.

AgentEvals is a powerful open-source package from LangChain designed to simplify the evaluation of agentic applications. It provides a collection of ready-made evaluators and utilities, with a particular focus on analyzing agent trajectories, the intermediate steps an agent takes to solve problems. This helps developers understand and improve the reliability and performance of their LLM agents.

GitHub: https://github.com/langchain-ai/agentevals
OSRepos URL: https://osrepos.com/repo/langchain-ai-agentevals

## Summary

AgentEvals is a powerful open-source package from LangChain designed to simplify the evaluation of agentic applications. It provides a collection of ready-made evaluators and utilities, with a particular focus on analyzing agent trajectories, the intermediate steps an agent takes to solve problems. This helps developers understand and improve the reliability and performance of their LLM agents.

## Topics

- Python
- LLM
- Agents
- Evaluation
- LangChain
- AI
- Testing

## Repository Information

Last analyzed by OSRepos: Tue Jun 30 2026 12:33:50 GMT+0100 (Western European Summer Time)
Detail views: 2
GitHub clicks: 2

## Safety Notice

OSRepos shares public repositories for knowledge and discovery only. Review source code, dependencies, licenses, and security implications before running or installing anything.

## Content

## Introduction

Agentic applications grant Large Language Models (LLMs) the freedom to control their own flow to solve complex problems. While this approach is incredibly powerful, the inherent black-box nature of LLMs can make it challenging to understand how changes impact an agent's behavior downstream. This makes robust evaluation critically important.

AgentEvals, developed by LangChain, offers a comprehensive suite of evaluators and utilities specifically designed for assessing the performance of your agents. Its primary focus is on **agent trajectory**, examining the intermediate steps an agent takes during its execution. This package serves as an excellent conceptual starting point for your agent's evaluation strategy. For more general evaluation tools, consider checking out its companion package, [`openevals`](https://github.com/langchain-ai/openevals).

## Installation

Getting started with AgentEvals is straightforward. You can install it using pip for Python or npm for TypeScript:

**Python**

bash
pip install agentevals


**TypeScript**

bash
npm install agentevals @langchain/core


For LLM-as-judge evaluators, you will also need an LLM client. By default, `agentevals` uses LangChain chat model integrations and comes with `langchain_openai` pre-installed. Alternatively, you can install the OpenAI client directly:

**Python**

bash
pip install openai


**TypeScript**

bash
npm install openai


It is also beneficial to be familiar with [evaluation concepts](https://docs.smith.langchain.com/evaluation/concepts) and LangSmith's pytest integration for running evaluations, which is documented [here](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest).

## Examples

AgentEvals provides various evaluators, including trajectory match evaluators and LLM-as-judge evaluators. Here is a quick example demonstrating how to use an LLM-as-judge evaluator to assess the accuracy of an agent's trajectory.

First, set your OpenAI API key as an environment variable:

bash
export OPENAI_API_KEY="your_openai_api_key"


Then, you can run your first trajectory evaluator. Agent trajectories are represented as a list of OpenAI-style messages:

python
from agentevals.trajectory.llm import create_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT
import json

trajectory_evaluator = create_trajectory_llm_as_judge(
    prompt=TRAJECTORY_ACCURACY_PROMPT,
    model="openai:o3-mini",
)

# This is a fake trajectory, in reality you would run your agent to get a real trajectory
outputs = [
    {"role": "user", "content": "What is the weather in SF?"},
    {
        "role": "assistant",
        "content": "",
        "tool_calls": [
            {
                "function": {
                    "name": "get_weather",
                    "arguments": json.dumps({"city": "SF"}),
                }
            }
        ],
    },
    {"role": "tool", "content": "It's 80 degrees and sunny in SF."},
    {"role": "assistant", "content": "The weather in SF is 80 degrees and sunny."},
]

eval_result = trajectory_evaluator(
  outputs=outputs,
)

print(eval_result)


This will output a result similar to:


{
  'key': 'trajectory_accuracy',
  'reasoning': 'The trajectory accurately follows the user's request for weather information in SF. Initially, the assistant recognizes the goal (providing weather details), then it efficiently makes a tool call to get the weather, and finally it communicates the result clearly. All steps demonstrate logical progression and efficiency. Thus, the score should be: true.',
  'score': true
}


AgentEvals also offers various `trajectory_match_mode` options (strict, unordered, subset, superset) and `tool_args_match_modes` for fine-grained control over how trajectories are compared against references. Additionally, it includes **Graph Trajectory** evaluators, designed for frameworks like LangGraph, which allow evaluation based on nodes visited rather than just messages.

## Why Use AgentEvals

Evaluating LLM agents is crucial for developing reliable and high-performing AI applications. AgentEvals provides a structured approach to this challenge:

*   **Understand Agent Behavior**: By focusing on trajectories, you gain insight into the decision-making process and intermediate steps of your agents.
*   **Ensure Reliability**: Implement automated checks to verify that agents consistently perform as expected, reducing unexpected behaviors.
*   **Facilitate Debugging**: Pinpoint exactly where an agent's trajectory deviates from the desired path, making debugging more efficient.
*   **Improve Performance**: Use evaluation results to iterate on agent design, leading to more efficient and accurate agentic applications.
*   **LangSmith Integration**: Seamlessly integrate with LangSmith for experiment tracking, detailed tracing, and comprehensive evaluation reporting over time.

## Links

Explore the AgentEvals repository on GitHub for more details, advanced usage, and contributions:

*   [GitHub Repository](https://github.com/langchain-ai/agentevals)