Phoenix: AI Observability and Evaluation Platform for LLMs

Introduction

Phoenix is an open-source AI observability platform built by Arize AI, specifically engineered for the experimentation, evaluation, and troubleshooting of large language model (LLM) applications. It offers a comprehensive suite of tools to gain deep insights into your AI systems.

Key capabilities include:

Tracing: Instrument your LLM application's runtime using OpenTelemetry for detailed visibility.
Evaluation: Benchmark application performance with LLM-powered response and retrieval evaluations.
Datasets: Create versioned datasets for experimentation, evaluation, and fine-tuning.
Experiments: Track and evaluate changes to prompts, LLMs, and retrieval strategies.
Playground: Optimize prompts, compare models, adjust parameters, and replay traced LLM calls.
Prompt Management: Systematically manage and test prompt changes with version control and tagging.
PXI (Phoenix Intelligence): An integrated AI engineering agent for debugging traces and iterating on prompts.

Phoenix is designed to be vendor and language agnostic, providing out-of-the-box support for popular frameworks like OpenAI Agents SDK, LangChain, LlamaIndex, and DSPy, as well as LLM providers such as OpenAI, Anthropic, and Google GenAI. It runs flexibly on your local machine, in a Jupyter notebook, as a containerized deployment, or in the cloud.

Installation

Getting started with Phoenix is straightforward. You can install the core package using pip or conda:

pip install arize-phoenix

For containerized deployments, Phoenix container images are available via Docker Hub and can be deployed using Docker or Kubernetes. Arize AI also offers cloud instances at app.phoenix.arize.com.

Examples

Phoenix offers extensive integration examples across various LLM frameworks and providers, demonstrating its versatility. It supports popular Python frameworks such as OpenAI Agents SDK, LlamaIndex, LangChain, and DSPy, enabling seamless tracing and evaluation within these ecosystems. For JavaScript developers, integrations include the OpenAI Node SDK, LangChain.js, and Vercel AI SDK, ensuring broad compatibility.

Beyond the main platform, Phoenix provides specialized lightweight Python sub-packages like arize-phoenix-otel for OpenTelemetry wrappers, arize-phoenix-client for API interaction, and arize-phoenix-evals for LLM evaluation tooling. Similar TypeScript sub-packages are also available. Additionally, the repository includes coding agent skills for platforms like Claude Code and Cursor, facilitating advanced debugging and evaluation workflows directly within your coding environment.

Why use Phoenix

Phoenix stands out as a critical tool for anyone developing and deploying LLM applications. Its comprehensive AI observability features, including tracing, evaluation, and dataset management, provide unparalleled visibility into your models' behavior. By being vendor and language agnostic and built on OpenTelemetry, Phoenix ensures maximum flexibility and integration with your existing tech stack. The platform's ability to track and evaluate prompt and model changes through experiments, coupled with its intuitive playground, empowers developers to iterate and optimize their AI systems efficiently. With flexible deployment options and a strong community, Phoenix is an essential asset for debugging, improving, and maintaining robust LLM applications.

Phoenix: AI Observability and Evaluation Platform for LLMs

Summary

Repository Information

Topics

Use at your own risk

Introduction

Installation

Examples

Why use Phoenix

Links

Related repositories

Observers: A Lightweight Library for AI Observability in Python

Source repository