====== Langfuse ====== **Langfuse** is an open-source LLM observability platform that provides tracing, evaluation, prompt management, and cost tracking for production LLM applications.(([[https://github.com/langfuse/langfuse|Langfuse GitHub Repository]])) With over **24,000 GitHub stars** and MIT licensing, it has become the leading open-source alternative for monitoring and debugging AI applications in production. | **Repository** | [[https://[[github|github]].com/langfuse/langfuse|github.com/langfuse/langfuse]] | | **License** | MIT | | **Language** | TypeScript, Python | | **Stars** | 24K+ | | **Category** | LLM Observability | ===== Key Features ===== * **Application Tracing**, Captures full request lifecycle including LLM calls, retrieval, [[embeddings|embeddings]], tools, and API operations * **[[llm_as_judge|LLM-as-a-Judge]]**, Native support for automated evaluation scoring on traces and observations * **Prompt Management**, Versioned prompt storage with UI for management and playground testing * **Cost Tracking**, Automatic per-trace/span tracking of token usage and model costs * **50+ Integrations**, Native support for [[langchain|LangChain]], [[llamaindex|LlamaIndex]], [[openai|OpenAI]], and OpenTelemetry * **Self-Hostable**, Full self-hosting via Docker Compose or Kubernetes with no vendor lock-in(([[https://langfuse.com/docs|Langfuse Documentation]])) * **Zero-Latency Instrumentation**, Async background flushing ensures no added latency to applications ===== Architecture ===== Langfuse V4 (March 2026) employs an observations-first, immutable data model aligned with OpenTelemetry spans:(([[https://langfuse.com/blog/2026-03-10-simplify-langfuse-for-scale|Langfuse V4 Architecture Blog]])) * **PostgreSQL**, Handles transactional data: users, organizations, projects, API keys, prompts, datasets, evaluation settings * **ClickHouse**, Stores immutable tracing data: observations, scores, traces as correlation IDs * **Redis + BullMQ**, Manages event queues for async processing * **Ingestion Pipeline**, Native Python/JS SDKs, 50+ integrations, OpenTelemetry endpoints; asynchronous batching ensures zero added latency The V4 architecture shifted to an observations-first model where traces are correlation IDs (like session_id) rather than top-level entities, with immutable spans ingested via OTel protocols. graph TB subgraph Apps["Instrumented Applications"] App1[Python App + SDK] App2[JS/TS App + SDK] App3[OpenTelemetry] App4[[[lite_llm|LiteLLM]] Gateway] end subgraph Ingestion["Ingestion Layer"] Queue[Redis + BullMQ] Batch[Micro-Batch Processor] end subgraph Storage["Storage Layer"] PG[(PostgreSQL - Transactional)] CH[(ClickHouse - Traces/Spans)] end subgraph Features["Feature Layer"] Trace[Trace Explorer] Eval[Evaluation Engine] Prompt[Prompt Manager] Cost[Cost Dashboard] Metrics[Metrics and Analytics] end subgraph UI["Web Dashboard"] Dashboard[Dashboard Views] Filters[Saved Filters] Graphs[Agent Graphs] end Apps --> Ingestion Queue --> Batch Batch --> Storage Storage --> Features Features --> UI ===== Tracing Capabilities ===== Langfuse captures the full request lifecycle with rich detail:(([[https://langfuse.com/docs|Langfuse Documentation]])) * **LLM Operations**, Inputs, outputs, latency, token usage, model parameters * **Non-LLM Operations**, Retrieval steps, embedding generation, tool calls, API requests * **Session Tracking**, Multi-turn conversations with user identification * **Agent Graphs**, Visual representation of agent decision flows * **Environment Tagging**, Separate traces by development, staging, and production * **Custom Attributes**, Arbitrary metadata for filtering and analysis ===== Evaluation Features ===== Langfuse supports multiple evaluation approaches: * **[[llm_as_judge|LLM-as-a-Judge]]**, Automated scoring using LLMs to evaluate trace quality * **Dataset Experiments**, Run evaluations against curated datasets * **Score Storage**, All scores stored in ClickHouse alongside traces for analysis * **Custom Evaluators**, Define custom scoring functions for domain-specific quality metrics ===== Integrations ===== Langfuse provides native integrations with the major LLM frameworks: * **[[langchain|LangChain]] / [[langgraph|LangGraph]]**, Automatic tracing via callback handlers * **[[llamaindex|LlamaIndex]]**, Native callback integration * **[[openai|OpenAI]] SDK**, Direct capture of prompts, completions, and token usage * **OpenTelemetry**, Standard OTel protocol support (60% of cloud traffic) * **[[lite_llm|LiteLLM]]**, Gateway-level tracing for multi-provider setups ===== Code Example ===== from langfuse import Langfuse from langfuse.decorators import observe, langfuse_context from [[openai|openai]] import [[openai|OpenAI]] langfuse = Langfuse( public_key="pk-lf-...", secret_key="sk-lf-...", host="https://cloud.langfuse.com" # or self-hosted URL ) client = [[openai|OpenAI]]() @observe() def retrieve_context(query: str) -> str: """Retrieve relevant context for the query.""" # Your retrieval logic here langfuse_context.update_current_observation( metadata={"retriever": "hybrid", "top_k": 5} ) return "Retrieved context about the topic..." @observe() def generate_answer(query: str) -> str: """Full RAG pipeline with automatic tracing.""" context = retrieve_context(query) response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": f"Context: {context}"}, {"role": "user", "content": query} ] ) # Score the trace langfuse_context.score_current_trace( name="relevance", value=0.9, comment="High relevance to query" ) return response.choices[0].message.content answer = generate_answer("How does RAG work?") print(answer) langfuse.flush() # Ensure all events are sent ===== See Also ===== * [[langsmith|LangSmith]] * [[arize_phoenix|Arize Phoenix]] * [[langchain|LangChain]] * [[deepeval|DeepEval]] * [[promptfoo|Promptfoo]] ===== References =====