====== Langfuse ======
**Langfuse** is an open-source LLM observability platform that provides tracing, evaluation, prompt management, and cost tracking for production LLM applications.(([[https://github.com/langfuse/langfuse|Langfuse GitHub Repository]])) With over **24,000 GitHub stars** and MIT licensing, it has become the leading open-source alternative for monitoring and debugging AI applications in production.
| **Repository** | [[https://[[github|github]].com/langfuse/langfuse|github.com/langfuse/langfuse]] |
| **License** | MIT |
| **Language** | TypeScript, Python |
| **Stars** | 24K+ |
| **Category** | LLM Observability |
===== Key Features =====
* **Application Tracing**, Captures full request lifecycle including LLM calls, retrieval, [[embeddings|embeddings]], tools, and API operations
* **[[llm_as_judge|LLM-as-a-Judge]]**, Native support for automated evaluation scoring on traces and observations
* **Prompt Management**, Versioned prompt storage with UI for management and playground testing
* **Cost Tracking**, Automatic per-trace/span tracking of token usage and model costs
* **50+ Integrations**, Native support for [[langchain|LangChain]], [[llamaindex|LlamaIndex]], [[openai|OpenAI]], and OpenTelemetry
* **Self-Hostable**, Full self-hosting via Docker Compose or Kubernetes with no vendor lock-in(([[https://langfuse.com/docs|Langfuse Documentation]]))
* **Zero-Latency Instrumentation**, Async background flushing ensures no added latency to applications
===== Architecture =====
Langfuse V4 (March 2026) employs an observations-first, immutable data model aligned with OpenTelemetry spans:(([[https://langfuse.com/blog/2026-03-10-simplify-langfuse-for-scale|Langfuse V4 Architecture Blog]]))
* **PostgreSQL**, Handles transactional data: users, organizations, projects, API keys, prompts, datasets, evaluation settings
* **ClickHouse**, Stores immutable tracing data: observations, scores, traces as correlation IDs
* **Redis + BullMQ**, Manages event queues for async processing
* **Ingestion Pipeline**, Native Python/JS SDKs, 50+ integrations, OpenTelemetry endpoints; asynchronous batching ensures zero added latency
The V4 architecture shifted to an observations-first model where traces are correlation IDs (like session_id) rather than top-level entities, with immutable spans ingested via OTel protocols.
graph TB
subgraph Apps["Instrumented Applications"]
App1[Python App + SDK]
App2[JS/TS App + SDK]
App3[OpenTelemetry]
App4[[[lite_llm|LiteLLM]] Gateway]
end
subgraph Ingestion["Ingestion Layer"]
Queue[Redis + BullMQ]
Batch[Micro-Batch Processor]
end
subgraph Storage["Storage Layer"]
PG[(PostgreSQL - Transactional)]
CH[(ClickHouse - Traces/Spans)]
end
subgraph Features["Feature Layer"]
Trace[Trace Explorer]
Eval[Evaluation Engine]
Prompt[Prompt Manager]
Cost[Cost Dashboard]
Metrics[Metrics and Analytics]
end
subgraph UI["Web Dashboard"]
Dashboard[Dashboard Views]
Filters[Saved Filters]
Graphs[Agent Graphs]
end
Apps --> Ingestion
Queue --> Batch
Batch --> Storage
Storage --> Features
Features --> UI
===== Tracing Capabilities =====
Langfuse captures the full request lifecycle with rich detail:(([[https://langfuse.com/docs|Langfuse Documentation]]))
* **LLM Operations**, Inputs, outputs, latency, token usage, model parameters
* **Non-LLM Operations**, Retrieval steps, embedding generation, tool calls, API requests
* **Session Tracking**, Multi-turn conversations with user identification
* **Agent Graphs**, Visual representation of agent decision flows
* **Environment Tagging**, Separate traces by development, staging, and production
* **Custom Attributes**, Arbitrary metadata for filtering and analysis
===== Evaluation Features =====
Langfuse supports multiple evaluation approaches:
* **[[llm_as_judge|LLM-as-a-Judge]]**, Automated scoring using LLMs to evaluate trace quality
* **Dataset Experiments**, Run evaluations against curated datasets
* **Score Storage**, All scores stored in ClickHouse alongside traces for analysis
* **Custom Evaluators**, Define custom scoring functions for domain-specific quality metrics
===== Integrations =====
Langfuse provides native integrations with the major LLM frameworks:
* **[[langchain|LangChain]] / [[langgraph|LangGraph]]**, Automatic tracing via callback handlers
* **[[llamaindex|LlamaIndex]]**, Native callback integration
* **[[openai|OpenAI]] SDK**, Direct capture of prompts, completions, and token usage
* **OpenTelemetry**, Standard OTel protocol support (60% of cloud traffic)
* **[[lite_llm|LiteLLM]]**, Gateway-level tracing for multi-provider setups
===== Code Example =====
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
from [[openai|openai]] import [[openai|OpenAI]]
langfuse = Langfuse(
public_key="pk-lf-...",
secret_key="sk-lf-...",
host="https://cloud.langfuse.com" # or self-hosted URL
)
client = [[openai|OpenAI]]()
@observe()
def retrieve_context(query: str) -> str:
"""Retrieve relevant context for the query."""
# Your retrieval logic here
langfuse_context.update_current_observation(
metadata={"retriever": "hybrid", "top_k": 5}
)
return "Retrieved context about the topic..."
@observe()
def generate_answer(query: str) -> str:
"""Full RAG pipeline with automatic tracing."""
context = retrieve_context(query)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Context: {context}"},
{"role": "user", "content": query}
]
)
# Score the trace
langfuse_context.score_current_trace(
name="relevance", value=0.9,
comment="High relevance to query"
)
return response.choices[0].message.content
answer = generate_answer("How does RAG work?")
print(answer)
langfuse.flush() # Ensure all events are sent
===== See Also =====
* [[langsmith|LangSmith]]
* [[arize_phoenix|Arize Phoenix]]
* [[langchain|LangChain]]
* [[deepeval|DeepEval]]
* [[promptfoo|Promptfoo]]
===== References =====