====== Arize Phoenix ======

**Arize Phoenix** is an open-source AI observability platform for tracing, evaluating, and troubleshooting LLM applications. With over **9,000 stars** on GitHub, it provides end-to-end visibility into AI system behavior using **OpenTelemetry-based instrumentation** — capturing traces of LLM flows across frameworks like LangChain, LlamaIndex, Haystack, DSPy, and providers like OpenAI and Bedrock.((Arize Phoenix GitHub Repository. [[https://github.com/Arize-ai/phoenix|github.com/Arize-ai/phoenix]]))

Phoenix combines tracing, evaluation, and dataset management in one tool, purpose-built for LLM-specific issues like prompt drift, hallucinations, tool flakiness, and cost analysis. It runs locally in Jupyter notebooks, self-hosted, or in the cloud — with zero vendor lock-in thanks to OpenTelemetry.((Official Documentation. [[https://phoenix.arize.com|phoenix.arize.com]]))

===== How It Works =====

Phoenix instruments your LLM application using **OpenTelemetry (OTEL)** and the **OpenInference** specification for AI-specific telemetry. Every LLM call, retrieval step, tool invocation, and reasoning chain is captured as a span within a trace. These traces are visualized in a web UI that shows the full execution path, latencies, token counts, and costs.

The platform then enables **LLM-powered evaluations** — running benchmarks for response quality, retrieval relevance, faithfulness, and toxicity against your traced data. Combined with versioned datasets and a prompt playground, Phoenix creates a complete experimentation loop.

===== Key Features =====

  * **OpenTelemetry tracing** — Vendor-agnostic, portable traces across any LLM stack
  * **Auto-instrumentation** — One-line setup for LangChain, LlamaIndex, OpenAI, Bedrock
  * **LLM evaluations** — Quality, relevance, toxicity benchmarks with human annotations
  * **Prompt Playground** — Side-by-side prompt testing and Span Replay for debugging
  * **Embedding clustering** — Group similar inputs/responses to isolate issues
  * **Dataset versioning** — Track changes across experiments and fine-tuning
  * **Flexible deployment** — Jupyter notebooks, self-hosted, or cloud.((Statsig. "Arize Phoenix AI Observability." [[https://www.statsig.com/perspectives/arize-phoenix-ai-observability|statsig.com]]))

===== Installation and Usage =====

<code python>
# Install Phoenix
# pip install arize-phoenix openinference-instrumentation-openai

import phoenix as px
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace as trace_api
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

# Launch Phoenix
session = px.launch_app()

# Set up OpenTelemetry tracing
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(
    SimpleSpanProcessor(px.otel.SimpleSpanExporter())
)
trace_api.set_tracer_provider(tracer_provider)

# Auto-instrument OpenAI
OpenAIInstrumentor().instrument()

# Now all OpenAI calls are automatically traced
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
# Trace is automatically captured and visible in Phoenix UI

# Run evaluations on traced data
from phoenix.evals import llm_classify

eval_results = llm_classify(
    dataframe=px.Client().get_spans_dataframe(),
    model=px.evals.OpenAIModel(model="gpt-4o"),
    template="Is this response relevant to the query? {input} -> {output}",
    rails=["relevant", "irrelevant"]
)
</code>

===== Architecture =====

<code>
%%{init: {'theme': 'dark'}}%%
graph TB
    App([Your LLM App]) -->|Auto-Instrumented| OTEL[OpenTelemetry SDK]
    OTEL -->|Spans + Traces| Collector[Phoenix Collector]
    Collector -->|Store| DB[Trace Database]
    DB -->|Query| UI[Phoenix Web UI]
    UI -->|Trace View| Traces[Trace Explorer]
    UI -->|Metrics| Dashboard[Latency / Cost / Tokens]
    UI -->|Clusters| Embed[Embedding Clusters]
    UI -->|Testing| Playground[Prompt Playground]
    DB -->|Evaluation| Evals[LLM Evaluators]
    Evals -->|Relevance / Faithfulness| Scores[Quality Scores]
    Evals -->|Toxicity / Hallucination| Safety[Safety Checks]
    DB -->|Export| Datasets[Versioned Datasets]
    Datasets -->|Fine-tuning| Training[Model Training]
    subgraph Instrumented Frameworks
        LC[LangChain]
        LI[LlamaIndex]
        OAI[OpenAI]
        Bed[Bedrock]
        Hay[Haystack]
    end
    App --- Instrumented Frameworks
</code>

===== Deployment Options =====

^ Mode ^ Description ^ Best For ^
| Notebook | 'px.launch_app()' in Jupyter | Rapid experimentation |
| Self-hosted | Docker container with persistent storage | Team collaboration |
| Cloud | Arize cloud platform | Production monitoring |((Arize AI Platform. [[https://arize.com|arize.com]]))

===== See Also =====

  * [[deepeval|DeepEval — Unit-Test Style LLM Evaluation]]
  * [[promptfoo|Promptfoo — LLM Evaluation and Red Teaming]]
  * [[chainlit|Chainlit — Conversational AI Framework]]
  * [[weaviate|Weaviate — Vector Database]]

===== References =====