{"name":"AgentEvals: Robust Evaluation Tools for LLM Agent Trajectories","description":"AgentEvals is a powerful open-source package from LangChain designed to simplify the evaluation of agentic applications. It provides a collection of ready-made evaluators and utilities, with a particular focus on analyzing agent trajectories, the intermediate steps an agent takes to solve problems. This helps developers understand and improve the reliability and performance of their LLM agents.","github":"https://github.com/langchain-ai/agentevals","url":"https://osrepos.com/repo/langchain-ai-agentevals","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/langchain-ai-agentevals","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/langchain-ai-agentevals.md","json":"https://osrepos.com/repo/langchain-ai-agentevals.json","topics":["Python","LLM","Agents","Evaluation","LangChain","AI","Testing"],"keywords":["Python","LLM","Agents","Evaluation","LangChain","AI","Testing"],"stars":null,"summary":"AgentEvals is a powerful open-source package from LangChain designed to simplify the evaluation of agentic applications. It provides a collection of ready-made evaluators and utilities, with a particular focus on analyzing agent trajectories, the intermediate steps an agent takes to solve problems. This helps developers understand and improve the reliability and performance of their LLM agents.","content":"## Introduction\n\nAgentic applications grant Large Language Models (LLMs) the freedom to control their own flow to solve complex problems. While this approach is incredibly powerful, the inherent black-box nature of LLMs can make it challenging to understand how changes impact an agent's behavior downstream. This makes robust evaluation critically important.\n\nAgentEvals, developed by LangChain, offers a comprehensive suite of evaluators and utilities specifically designed for assessing the performance of your agents. Its primary focus is on **agent trajectory**, examining the intermediate steps an agent takes during its execution. This package serves as an excellent conceptual starting point for your agent's evaluation strategy. For more general evaluation tools, consider checking out its companion package, [`openevals`](https://github.com/langchain-ai/openevals).\n\n## Installation\n\nGetting started with AgentEvals is straightforward. You can install it using pip for Python or npm for TypeScript:\n\n**Python**\n\nbash\npip install agentevals\n\n\n**TypeScript**\n\nbash\nnpm install agentevals @langchain/core\n\n\nFor LLM-as-judge evaluators, you will also need an LLM client. By default, `agentevals` uses LangChain chat model integrations and comes with `langchain_openai` pre-installed. Alternatively, you can install the OpenAI client directly:\n\n**Python**\n\nbash\npip install openai\n\n\n**TypeScript**\n\nbash\nnpm install openai\n\n\nIt is also beneficial to be familiar with [evaluation concepts](https://docs.smith.langchain.com/evaluation/concepts) and LangSmith's pytest integration for running evaluations, which is documented [here](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest).\n\n## Examples\n\nAgentEvals provides various evaluators, including trajectory match evaluators and LLM-as-judge evaluators. Here is a quick example demonstrating how to use an LLM-as-judge evaluator to assess the accuracy of an agent's trajectory.\n\nFirst, set your OpenAI API key as an environment variable:\n\nbash\nexport OPENAI_API_KEY=\"your_openai_api_key\"\n\n\nThen, you can run your first trajectory evaluator. Agent trajectories are represented as a list of OpenAI-style messages:\n\npython\nfrom agentevals.trajectory.llm import create_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT\nimport json\n\ntrajectory_evaluator = create_trajectory_llm_as_judge(\n    prompt=TRAJECTORY_ACCURACY_PROMPT,\n    model=\"openai:o3-mini\",\n)\n\n# This is a fake trajectory, in reality you would run your agent to get a real trajectory\noutputs = [\n    {\"role\": \"user\", \"content\": \"What is the weather in SF?\"},\n    {\n        \"role\": \"assistant\",\n        \"content\": \"\",\n        \"tool_calls\": [\n            {\n                \"function\": {\n                    \"name\": \"get_weather\",\n                    \"arguments\": json.dumps({\"city\": \"SF\"}),\n                }\n            }\n        ],\n    },\n    {\"role\": \"tool\", \"content\": \"It's 80 degrees and sunny in SF.\"},\n    {\"role\": \"assistant\", \"content\": \"The weather in SF is 80 degrees and sunny.\"},\n]\n\neval_result = trajectory_evaluator(\n  outputs=outputs,\n)\n\nprint(eval_result)\n\n\nThis will output a result similar to:\n\n\n{\n  'key': 'trajectory_accuracy',\n  'reasoning': 'The trajectory accurately follows the user's request for weather information in SF. Initially, the assistant recognizes the goal (providing weather details), then it efficiently makes a tool call to get the weather, and finally it communicates the result clearly. All steps demonstrate logical progression and efficiency. Thus, the score should be: true.',\n  'score': true\n}\n\n\nAgentEvals also offers various `trajectory_match_mode` options (strict, unordered, subset, superset) and `tool_args_match_modes` for fine-grained control over how trajectories are compared against references. Additionally, it includes **Graph Trajectory** evaluators, designed for frameworks like LangGraph, which allow evaluation based on nodes visited rather than just messages.\n\n## Why Use AgentEvals\n\nEvaluating LLM agents is crucial for developing reliable and high-performing AI applications. AgentEvals provides a structured approach to this challenge:\n\n*   **Understand Agent Behavior**: By focusing on trajectories, you gain insight into the decision-making process and intermediate steps of your agents.\n*   **Ensure Reliability**: Implement automated checks to verify that agents consistently perform as expected, reducing unexpected behaviors.\n*   **Facilitate Debugging**: Pinpoint exactly where an agent's trajectory deviates from the desired path, making debugging more efficient.\n*   **Improve Performance**: Use evaluation results to iterate on agent design, leading to more efficient and accurate agentic applications.\n*   **LangSmith Integration**: Seamlessly integrate with LangSmith for experiment tracking, detailed tracing, and comprehensive evaluation reporting over time.\n\n## Links\n\nExplore the AgentEvals repository on GitHub for more details, advanced usage, and contributions:\n\n*   [GitHub Repository](https://github.com/langchain-ai/agentevals)","metrics":{"detailViews":2,"githubClicks":2},"dates":{"published":null,"modified":"2026-06-30T11:33:50.000Z"}}