====== Arize Phoenix ======
**Arize Phoenix** is an open-source AI observability platform for tracing, evaluating, and troubleshooting LLM applications. With over **9,000 stars** on GitHub, it provides end-to-end visibility into AI system behavior using **OpenTelemetry-based instrumentation** — capturing traces of LLM flows across frameworks like LangChain, LlamaIndex, Haystack, DSPy, and providers like OpenAI and Bedrock.((Arize Phoenix GitHub Repository. [[https://github.com/Arize-ai/phoenix|github.com/Arize-ai/phoenix]]))
Phoenix combines tracing, evaluation, and dataset management in one tool, purpose-built for LLM-specific issues like prompt drift, hallucinations, tool flakiness, and cost analysis. It runs locally in Jupyter notebooks, self-hosted, or in the cloud — with zero vendor lock-in thanks to OpenTelemetry.((Official Documentation. [[https://phoenix.arize.com|phoenix.arize.com]]))
===== How It Works =====
Phoenix instruments your LLM application using **OpenTelemetry (OTEL)** and the **OpenInference** specification for AI-specific telemetry. Every LLM call, retrieval step, tool invocation, and reasoning chain is captured as a span within a trace. These traces are visualized in a web UI that shows the full execution path, latencies, token counts, and costs.
The platform then enables **LLM-powered evaluations** — running benchmarks for response quality, retrieval relevance, faithfulness, and toxicity against your traced data. Combined with versioned datasets and a prompt playground, Phoenix creates a complete experimentation loop.
===== Key Features =====
* **OpenTelemetry tracing** — Vendor-agnostic, portable traces across any LLM stack
* **Auto-instrumentation** — One-line setup for LangChain, LlamaIndex, OpenAI, Bedrock
* **LLM evaluations** — Quality, relevance, toxicity benchmarks with human annotations
* **Prompt Playground** — Side-by-side prompt testing and Span Replay for debugging
* **Embedding clustering** — Group similar inputs/responses to isolate issues
* **Dataset versioning** — Track changes across experiments and fine-tuning
* **Flexible deployment** — Jupyter notebooks, self-hosted, or cloud.((Statsig. "Arize Phoenix AI Observability." [[https://www.statsig.com/perspectives/arize-phoenix-ai-observability|statsig.com]]))
===== Installation and Usage =====
# Install Phoenix
# pip install arize-phoenix openinference-instrumentation-openai
import phoenix as px
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace as trace_api
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
# Launch Phoenix
session = px.launch_app()
# Set up OpenTelemetry tracing
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(
SimpleSpanProcessor(px.otel.SimpleSpanExporter())
)
trace_api.set_tracer_provider(tracer_provider)
# Auto-instrument OpenAI
OpenAIInstrumentor().instrument()
# Now all OpenAI calls are automatically traced
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
# Trace is automatically captured and visible in Phoenix UI
# Run evaluations on traced data
from phoenix.evals import llm_classify
eval_results = llm_classify(
dataframe=px.Client().get_spans_dataframe(),
model=px.evals.OpenAIModel(model="gpt-4o"),
template="Is this response relevant to the query? {input} -> {output}",
rails=["relevant", "irrelevant"]
)
===== Architecture =====
%%{init: {'theme': 'dark'}}%%
graph TB
App([Your LLM App]) -->|Auto-Instrumented| OTEL[OpenTelemetry SDK]
OTEL -->|Spans + Traces| Collector[Phoenix Collector]
Collector -->|Store| DB[Trace Database]
DB -->|Query| UI[Phoenix Web UI]
UI -->|Trace View| Traces[Trace Explorer]
UI -->|Metrics| Dashboard[Latency / Cost / Tokens]
UI -->|Clusters| Embed[Embedding Clusters]
UI -->|Testing| Playground[Prompt Playground]
DB -->|Evaluation| Evals[LLM Evaluators]
Evals -->|Relevance / Faithfulness| Scores[Quality Scores]
Evals -->|Toxicity / Hallucination| Safety[Safety Checks]
DB -->|Export| Datasets[Versioned Datasets]
Datasets -->|Fine-tuning| Training[Model Training]
subgraph Instrumented Frameworks
LC[LangChain]
LI[LlamaIndex]
OAI[OpenAI]
Bed[Bedrock]
Hay[Haystack]
end
App --- Instrumented Frameworks
===== Deployment Options =====
^ Mode ^ Description ^ Best For ^
| Notebook | 'px.launch_app()' in Jupyter | Rapid experimentation |
| Self-hosted | Docker container with persistent storage | Team collaboration |
| Cloud | Arize cloud platform | Production monitoring |((Arize AI Platform. [[https://arize.com|arize.com]]))
===== See Also =====
* [[deepeval|DeepEval — Unit-Test Style LLM Evaluation]]
* [[promptfoo|Promptfoo — LLM Evaluation and Red Teaming]]
* [[chainlit|Chainlit — Conversational AI Framework]]
* [[weaviate|Weaviate — Vector Database]]
===== References =====