Arize Phoenix

Arize Phoenix is an open-source AI observability platform for tracing, evaluating, and troubleshooting LLM applications. With over 9,000 stars on GitHub, it provides end-to-end visibility into AI system behavior using OpenTelemetry-based instrumentation — capturing traces of LLM flows across frameworks like LangChain, LlamaIndex, Haystack, DSPy, and providers like OpenAI and Bedrock.¹⁾

Phoenix combines tracing, evaluation, and dataset management in one tool, purpose-built for LLM-specific issues like prompt drift, hallucinations, tool flakiness, and cost analysis. It runs locally in Jupyter notebooks, self-hosted, or in the cloud — with zero vendor lock-in thanks to OpenTelemetry.²⁾

How It Works

Phoenix instruments your LLM application using OpenTelemetry (OTEL) and the OpenInference specification for AI-specific telemetry. Every LLM call, retrieval step, tool invocation, and reasoning chain is captured as a span within a trace. These traces are visualized in a web UI that shows the full execution path, latencies, token counts, and costs.

The platform then enables LLM-powered evaluations — running benchmarks for response quality, retrieval relevance, faithfulness, and toxicity against your traced data. Combined with versioned datasets and a prompt playground, Phoenix creates a complete experimentation loop.

Key Features

OpenTelemetry tracing — Vendor-agnostic, portable traces across any LLM stack
Auto-instrumentation — One-line setup for LangChain, LlamaIndex, OpenAI, Bedrock
LLM evaluations — Quality, relevance, toxicity benchmarks with human annotations
Prompt Playground — Side-by-side prompt testing and Span Replay for debugging
Embedding clustering — Group similar inputs/responses to isolate issues
Dataset versioning — Track changes across experiments and fine-tuning
Flexible deployment — Jupyter notebooks, self-hosted, or cloud.³⁾

Installation and Usage

# Install Phoenix
# pip install arize-phoenix openinference-instrumentation-openai
 
import phoenix as px
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace as trace_api
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
 
# Launch Phoenix
session = px.launch_app()
 
# Set up OpenTelemetry tracing
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(
    SimpleSpanProcessor(px.otel.SimpleSpanExporter())
)
trace_api.set_tracer_provider(tracer_provider)
 
# Auto-instrument OpenAI
OpenAIInstrumentor().instrument()
 
# Now all OpenAI calls are automatically traced
from openai import OpenAI
client = OpenAI()
 
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
# Trace is automatically captured and visible in Phoenix UI
 
# Run evaluations on traced data
from phoenix.evals import llm_classify
 
eval_results = llm_classify(
    dataframe=px.Client().get_spans_dataframe(),
    model=px.evals.OpenAIModel(model="gpt-4o"),
    template="Is this response relevant to the query? {input} -> {output}",
    rails=["relevant", "irrelevant"]
)

Architecture

%%{init: {'theme': 'dark'}}%%
graph TB
    App([Your LLM App]) -->|Auto-Instrumented| OTEL[OpenTelemetry SDK]
    OTEL -->|Spans + Traces| Collector[Phoenix Collector]
    Collector -->|Store| DB[Trace Database]
    DB -->|Query| UI[Phoenix Web UI]
    UI -->|Trace View| Traces[Trace Explorer]
    UI -->|Metrics| Dashboard[Latency / Cost / Tokens]
    UI -->|Clusters| Embed[Embedding Clusters]
    UI -->|Testing| Playground[Prompt Playground]
    DB -->|Evaluation| Evals[LLM Evaluators]
    Evals -->|Relevance / Faithfulness| Scores[Quality Scores]
    Evals -->|Toxicity / Hallucination| Safety[Safety Checks]
    DB -->|Export| Datasets[Versioned Datasets]
    Datasets -->|Fine-tuning| Training[Model Training]
    subgraph Instrumented Frameworks
        LC[LangChain]
        LI[LlamaIndex]
        OAI[OpenAI]
        Bed[Bedrock]
        Hay[Haystack]
    end
    App --- Instrumented Frameworks

Deployment Options

Mode	Description	Best For
Notebook	'px.launch_app()' in Jupyter	Rapid experimentation
Self-hosted	Docker container with persistent storage	Team collaboration
Cloud	Arize cloud platform	Production monitoring

References

¹⁾

Arize Phoenix GitHub Repository. github.com/Arize-ai/phoenix

²⁾

Official Documentation. phoenix.arize.com

³⁾

Statsig. “Arize Phoenix AI Observability.” statsig.com

Table of Contents