This is an old revision of the document!

W&B Weave

W&B Weave is an open-source observability, evaluation, and monitoring toolkit by Weights & Biases for developing reliable LLM and generative AI applications. It provides full tracing of LLM calls, systematic evaluation frameworks, production monitoring with guardrails, and deep integration with the broader W&B ecosystem for end-to-end AI development.

Overview

Weave addresses the fundamental challenges of LLM application development: non-determinism, subjective output quality, and sensitivity to prompt changes. It provides visibility into every LLM call, versions all artifacts (prompts, datasets, models, configs), and enables systematic experimentation and comparison.

Core capabilities:

Tracing – Full visibility into every LLM call, input, output, latency, and token cost
Evaluation – Systematic benchmarking with custom scorers, datasets, and aggregated metrics
Monitoring – Production guardrails for toxicity, hallucination, and quality issues on live traffic
Versioning – Automatic versioning of prompts, datasets, models, and configurations
Feedback – Human annotation collection via UI or API
Leaderboards – Color-coded comparison matrices for ranking models and prompts

Architecture

+-------------------------------------------------+
|              Your LLM Application                |
|  +----------+  +----------+  +--------------+   |
|  | @weave.op|  | @weave.op|  |  weave.Model |   |
|  | (scorer) |  | (chain)  |  |  (predict)   |   |
|  +----+-----+  +----+-----+  +------+-------+   |
+-------|--------------|--------------|-----------+
        |              |              |
        v              v              v
+-------------------------------------------------+
|                W&B Weave Platform                |
|  +----------+  +-----------+  +--------------+  |
|  |  Traces  |  |Evaluations|  | Leaderboards |  |
|  +----------+  +-----------+  +--------------+  |
|  +----------+  +-----------+  +--------------+  |
|  | Versions |  | Feedback  |  |  Monitoring  |  |
|  +----------+  +-----------+  +--------------+  |
+----------------------+--------------------------+
                       |
                       v
+-------------------------------------------------+
|            W&B Ecosystem                        |
|  Models | Fine-tuning | Registry | Experiments  |
+-------------------------------------------------+

Getting Started

pip install weave

import weave
 
# Initialize a Weave project
weave.init("my-llm-project")

Core Concepts: @weave.op

The @weave.op() decorator is the foundation of Weave. It turns any function into a traceable, versioned operation that logs inputs, outputs, and metadata automatically.

import weave
import openai
 
weave.init("my-app")
client = openai.OpenAI()
 
@weave.op()
def generate_answer(question: str) -> str:
    "Generate an answer using an LLM. Automatically traced by Weave."
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content
 
@weave.op()
def check_relevance(question: str, answer: str) -> dict:
    "Score whether the answer is relevant to the question."
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user",
             "content": f"Is this answer relevant? "
                        f"Question: {question} Answer: {answer} "
                        f"Reply yes or no."}
        ]
    )
    is_relevant = "yes" in response.choices[0].message.content.lower()
    return {"relevant": is_relevant}
 
# Every call is traced with full I/O logging
answer = generate_answer("What is retrieval-augmented generation?")
relevance = check_relevance("What is RAG?", answer)

Evaluation Framework

Weaves evaluation system benchmarks LLM applications against datasets using custom scorers:

import weave
from weave import Evaluation, Model
import asyncio
 
weave.init("eval-project")
 
 
# Define a Model with a predict method
class QAModel(Model):
    model_name: str
    temperature: float = 0.0
 
    @weave.op()
    def predict(self, question: str) -> dict:
        response = client.chat.completions.create(
            model=self.model_name,
            temperature=self.temperature,
            messages=[{"role": "user", "content": question}]
        )
        return {"answer": response.choices[0].message.content}
 
 
# Define scorers
@weave.op()
def exact_match(expected: str, output: dict) -> dict:
    return {"match": expected.lower() == output["answer"].lower()}
 
@weave.op()
def length_check(output: dict) -> dict:
    return {"reasonable_length": 10 < len(output["answer"]) < 2000}
 
 
# Create evaluation dataset
dataset = [
    {"question": "What is the capital of France?", "expected": "Paris"},
    {"question": "What is 2 + 2?", "expected": "4"},
    {"question": "Who wrote Hamlet?", "expected": "Shakespeare"},
]
 
# Run evaluation
model = QAModel(model_name="gpt-4o-mini")
evaluation = Evaluation(dataset=dataset, scorers=[exact_match, length_check])
asyncio.run(evaluation.evaluate(model))
# Results appear in Weave UI with aggregated metrics

Leaderboards

Weave Leaderboards provide color-coded comparison matrices for ranking models and prompts:

Run evaluations across multiple model variants
Select evaluation runs in the UI and click Compare
View performance matrices with baseline highlighting
Identify regressions and improvements at a glance

This is particularly valuable for systematic prompt optimization and model selection.

Pre-built Scorers

Weave includes production-ready scorers for common quality checks:

Hallucination detection – Checks outputs against source context
Toxicity/moderation – Flags harmful or inappropriate content
Context precision – Measures retrieval relevance in RAG pipelines
Token cost tracking – Monitors spending across calls
Latency measurement – Tracks response times

Production Monitoring

Weave applies evaluation scorers to live production traffic:

Set guardrails that alert on quality degradation
Monitor token costs and latency in real-time
Collect human feedback through the UI or API
Compare production behavior against evaluation baselines

W&B Ecosystem Integration

Weave connects with the broader Weights & Biases platform:

W&B Models – Track model training and fine-tuning
W&B Registry – Version datasets, models, prompts, and code
Experiments – Traditional ML experiment tracking
Serverless fine-tuning – RL-based model improvement

This enables end-to-end workflows from model development through production monitoring.

Comparison to LangSmith

	W&B Weave	LangSmith
Focus	Full AI lifecycle with W&B	LLM app observability
Tracing	@weave.op decorator	Auto (LangChain) or SDK
Evaluation	Datasets + custom scorers	Datasets + LLM judges
Unique	Leaderboards, W&B ecosystem	LangGraph integration
Monitoring	Production guardrails	Dashboard metrics
Best for	W&B-native ML teams	LangChain users

AI Agent Knowledge Base

Sidebar

Table of Contents

W&B Weave

Overview

Architecture

Getting Started

Core Concepts: @weave.op

Evaluation Framework

Leaderboards

Pre-built Scorers

Production Monitoring

W&B Ecosystem Integration

Comparison to LangSmith

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

W&B Weave

Overview

Architecture

Getting Started

Core Concepts: @weave.op

Evaluation Framework

Leaderboards

Pre-built Scorers

Production Monitoring

W&B Ecosystem Integration

Comparison to LangSmith

References

See Also

Page Tools