====== Arize Phoenix ====== **Arize Phoenix** is an open-source AI observability platform for tracing, evaluating, and troubleshooting LLM applications. With over **9,000 stars** on GitHub, it provides end-to-end visibility into AI system behavior using **OpenTelemetry-based instrumentation** — capturing traces of LLM flows across frameworks like LangChain, LlamaIndex, Haystack, DSPy, and providers like OpenAI and Bedrock.((Arize Phoenix GitHub Repository. [[https://github.com/Arize-ai/phoenix|github.com/Arize-ai/phoenix]])) Phoenix combines tracing, evaluation, and dataset management in one tool, purpose-built for LLM-specific issues like prompt drift, hallucinations, tool flakiness, and cost analysis. It runs locally in Jupyter notebooks, self-hosted, or in the cloud — with zero vendor lock-in thanks to OpenTelemetry.((Official Documentation. [[https://phoenix.arize.com|phoenix.arize.com]])) ===== How It Works ===== Phoenix instruments your LLM application using **OpenTelemetry (OTEL)** and the **OpenInference** specification for AI-specific telemetry. Every LLM call, retrieval step, tool invocation, and reasoning chain is captured as a span within a trace. These traces are visualized in a web UI that shows the full execution path, latencies, token counts, and costs. The platform then enables **LLM-powered evaluations** — running benchmarks for response quality, retrieval relevance, faithfulness, and toxicity against your traced data. Combined with versioned datasets and a prompt playground, Phoenix creates a complete experimentation loop. ===== Key Features ===== * **OpenTelemetry tracing** — Vendor-agnostic, portable traces across any LLM stack * **Auto-instrumentation** — One-line setup for LangChain, LlamaIndex, OpenAI, Bedrock * **LLM evaluations** — Quality, relevance, toxicity benchmarks with human annotations * **Prompt Playground** — Side-by-side prompt testing and Span Replay for debugging * **Embedding clustering** — Group similar inputs/responses to isolate issues * **Dataset versioning** — Track changes across experiments and fine-tuning * **Flexible deployment** — Jupyter notebooks, self-hosted, or cloud.((Statsig. "Arize Phoenix AI Observability." [[https://www.statsig.com/perspectives/arize-phoenix-ai-observability|statsig.com]])) ===== Installation and Usage ===== # Install Phoenix # pip install arize-phoenix openinference-instrumentation-openai import phoenix as px from openinference.instrumentation.openai import OpenAIInstrumentor from opentelemetry import trace as trace_api from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import SimpleSpanProcessor # Launch Phoenix session = px.launch_app() # Set up OpenTelemetry tracing tracer_provider = TracerProvider() tracer_provider.add_span_processor( SimpleSpanProcessor(px.otel.SimpleSpanExporter()) ) trace_api.set_tracer_provider(tracer_provider) # Auto-instrument OpenAI OpenAIInstrumentor().instrument() # Now all OpenAI calls are automatically traced from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Explain quantum computing"}] ) # Trace is automatically captured and visible in Phoenix UI # Run evaluations on traced data from phoenix.evals import llm_classify eval_results = llm_classify( dataframe=px.Client().get_spans_dataframe(), model=px.evals.OpenAIModel(model="gpt-4o"), template="Is this response relevant to the query? {input} -> {output}", rails=["relevant", "irrelevant"] ) ===== Architecture ===== %%{init: {'theme': 'dark'}}%% graph TB App([Your LLM App]) -->|Auto-Instrumented| OTEL[OpenTelemetry SDK] OTEL -->|Spans + Traces| Collector[Phoenix Collector] Collector -->|Store| DB[Trace Database] DB -->|Query| UI[Phoenix Web UI] UI -->|Trace View| Traces[Trace Explorer] UI -->|Metrics| Dashboard[Latency / Cost / Tokens] UI -->|Clusters| Embed[Embedding Clusters] UI -->|Testing| Playground[Prompt Playground] DB -->|Evaluation| Evals[LLM Evaluators] Evals -->|Relevance / Faithfulness| Scores[Quality Scores] Evals -->|Toxicity / Hallucination| Safety[Safety Checks] DB -->|Export| Datasets[Versioned Datasets] Datasets -->|Fine-tuning| Training[Model Training] subgraph Instrumented Frameworks LC[LangChain] LI[LlamaIndex] OAI[OpenAI] Bed[Bedrock] Hay[Haystack] end App --- Instrumented Frameworks ===== Deployment Options ===== ^ Mode ^ Description ^ Best For ^ | Notebook | 'px.launch_app()' in Jupyter | Rapid experimentation | | Self-hosted | Docker container with persistent storage | Team collaboration | | Cloud | Arize cloud platform | Production monitoring |((Arize AI Platform. [[https://arize.com|arize.com]])) ===== See Also ===== * [[deepeval|DeepEval — Unit-Test Style LLM Evaluation]] * [[promptfoo|Promptfoo — LLM Evaluation and Red Teaming]] * [[chainlit|Chainlit — Conversational AI Framework]] * [[weaviate|Weaviate — Vector Database]] ===== References =====