AgentEvals: Robust Evaluation Tools for LLM Agent Trajectories

This repository profile is provided by osrepos.com, an open source repository discovery platform.

AgentEvals: Robust Evaluation Tools for LLM Agent Trajectories

Summary

AgentEvals is a powerful open-source package from LangChain designed to simplify the evaluation of agentic applications. It provides a collection of ready-made evaluators and utilities, with a particular focus on analyzing agent trajectories, the intermediate steps an agent takes to solve problems. This helps developers understand and improve the reliability and performance of their LLM agents.

Repository Information

Analyzed by OSRepos on June 30, 2026

Topics

Click on any tag to explore related repositories

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

Agentic applications grant Large Language Models (LLMs) the freedom to control their own flow to solve complex problems. While this approach is incredibly powerful, the inherent black-box nature of LLMs can make it challenging to understand how changes impact an agent's behavior downstream. This makes robust evaluation critically important.

AgentEvals, developed by LangChain, offers a comprehensive suite of evaluators and utilities specifically designed for assessing the performance of your agents. Its primary focus is on agent trajectory, examining the intermediate steps an agent takes during its execution. This package serves as an excellent conceptual starting point for your agent's evaluation strategy. For more general evaluation tools, consider checking out its companion package, openevals.

Installation

Getting started with AgentEvals is straightforward. You can install it using pip for Python or npm for TypeScript:

Python

pip install agentevals

TypeScript

npm install agentevals @langchain/core

For LLM-as-judge evaluators, you will also need an LLM client. By default, agentevals uses LangChain chat model integrations and comes with langchain_openai pre-installed. Alternatively, you can install the OpenAI client directly:

Python

pip install openai

TypeScript

npm install openai

It is also beneficial to be familiar with evaluation concepts and LangSmith's pytest integration for running evaluations, which is documented here.

Examples

AgentEvals provides various evaluators, including trajectory match evaluators and LLM-as-judge evaluators. Here is a quick example demonstrating how to use an LLM-as-judge evaluator to assess the accuracy of an agent's trajectory.

First, set your OpenAI API key as an environment variable:

export OPENAI_API_KEY="your_openai_api_key"

Then, you can run your first trajectory evaluator. Agent trajectories are represented as a list of OpenAI-style messages:

from agentevals.trajectory.llm import create_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT
import json

trajectory_evaluator = create_trajectory_llm_as_judge(
    prompt=TRAJECTORY_ACCURACY_PROMPT,
    model="openai:o3-mini",
)

# This is a fake trajectory, in reality you would run your agent to get a real trajectory
outputs = [
    {"role": "user", "content": "What is the weather in SF?"},
    {
        "role": "assistant",
        "content": "",
        "tool_calls": [
            {
                "function": {
                    "name": "get_weather",
                    "arguments": json.dumps({"city": "SF"}),
                }
            }
        ],
    },
    {"role": "tool", "content": "It's 80 degrees and sunny in SF."},
    {"role": "assistant", "content": "The weather in SF is 80 degrees and sunny."},
]

eval_result = trajectory_evaluator(
  outputs=outputs,
)

print(eval_result)

This will output a result similar to:


{
  'key': 'trajectory_accuracy',
  'reasoning': 'The trajectory accurately follows the user's request for weather information in SF. Initially, the assistant recognizes the goal (providing weather details), then it efficiently makes a tool call to get the weather, and finally it communicates the result clearly. All steps demonstrate logical progression and efficiency. Thus, the score should be: true.',
  'score': true
}

AgentEvals also offers various trajectory_match_mode options (strict, unordered, subset, superset) and tool_args_match_modes for fine-grained control over how trajectories are compared against references. Additionally, it includes Graph Trajectory evaluators, designed for frameworks like LangGraph, which allow evaluation based on nodes visited rather than just messages.

Why Use AgentEvals

Evaluating LLM agents is crucial for developing reliable and high-performing AI applications. AgentEvals provides a structured approach to this challenge:

  • Understand Agent Behavior: By focusing on trajectories, you gain insight into the decision-making process and intermediate steps of your agents.
  • Ensure Reliability: Implement automated checks to verify that agents consistently perform as expected, reducing unexpected behaviors.
  • Facilitate Debugging: Pinpoint exactly where an agent's trajectory deviates from the desired path, making debugging more efficient.
  • Improve Performance: Use evaluation results to iterate on agent design, leading to more efficient and accurate agentic applications.
  • LangSmith Integration: Seamlessly integrate with LangSmith for experiment tracking, detailed tracing, and comprehensive evaluation reporting over time.

Links

Explore the AgentEvals repository on GitHub for more details, advanced usage, and contributions:

Related repositories

Similar repositories that may be relevant next.

EvalPlus: Rigorous Evaluation for LLM-Synthesized Code

EvalPlus: Rigorous Evaluation for LLM-Synthesized Code

June 30, 2026

EvalPlus is a robust framework designed for the rigorous evaluation of code generated by Large Language Models (LLMs). It extends standard benchmarks like HumanEval and MBPP with significantly more tests, offering precise assessment of code correctness and efficiency. This tool is crucial for developers and researchers aiming to thoroughly validate LLM-synthesized code.

benchmarklarge-language-modelsprogram-synthesis
Phoenix: AI Observability and Evaluation Platform for LLMs

Phoenix: AI Observability and Evaluation Platform for LLMs

June 28, 2026

Phoenix is an open-source AI observability platform from Arize AI, designed for comprehensive experimentation, evaluation, and troubleshooting of LLM applications. It provides robust features including OpenTelemetry-based tracing, LLM evaluation, and systematic prompt management. This platform helps developers optimize and debug their AI models effectively across various environments.

AI ObservabilityLLM EvaluationPrompt Engineering
Observers: A Lightweight Library for AI Observability in Python

Observers: A Lightweight Library for AI Observability in Python

June 28, 2026

Observers is a Python library designed for AI observability, enabling developers to track and store interactions with generative AI APIs. It provides a flexible framework with various observers for popular LLM providers and multiple storage backends. This tool helps in monitoring, debugging, and analyzing AI model behavior effectively.

PythonAI ObservabilityLLM
Jsonformer: Bulletproof Structured JSON Generation from Language Models

Jsonformer: Bulletproof Structured JSON Generation from Language Models

June 27, 2026

Jsonformer is a powerful library designed to generate syntactically correct and schema-conforming JSON from language models. It addresses the common challenge of unreliable JSON output by focusing on generating only content tokens, making the process more efficient and robust. This approach ensures bulletproof structured data generation for various applications.

JSONLanguage ModelsAI

Source repository

Open the original repository on GitHub.

View on GitHub
OS
OSRepos

Analysis and discovery of open source repositories. Find interesting projects and follow their updates.

Monitor your website with YourWebsiteScore

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of third-party repository code is at your own risk. Always review source code, dependencies, licenses, and security implications before running anything.

© 2025 OSRepos. Built with Nuxt 3 and lots of ❤️