====== W&B Weave ======
**W&B Weave** is an open-source observability, evaluation, and monitoring toolkit by [[https://wandb.ai|Weights & Biases]] for developing reliable LLM and generative AI applications. It provides full tracing of LLM calls, systematic evaluation frameworks, production monitoring with guardrails, and deep integration with the broader W&B ecosystem for end-to-end AI development.
===== Overview =====
Weave addresses the fundamental challenges of LLM application development: non-determinism, subjective output quality, and sensitivity to prompt changes. It provides visibility into every LLM call, versions all artifacts (prompts, datasets, models, configs), and enables systematic experimentation and comparison.
Core capabilities:
* **Tracing** -- Full visibility into every LLM call, input, output, latency, and token cost
* **Evaluation** -- Systematic benchmarking with custom scorers, datasets, and aggregated metrics
* **Monitoring** -- Production guardrails for toxicity, hallucination, and quality issues on live traffic
* **Versioning** -- Automatic versioning of prompts, datasets, models, and configurations
* **Feedback** -- Human annotation collection via UI or API
* **Leaderboards** -- Color-coded comparison matrices for ranking models and prompts
===== Architecture =====
graph TD
subgraph App["Your LLM Application"]
A["@weave.op: scorer"]
B["@weave.op: chain"]
C["weave.Model: predict"]
end
App --> Platform
subgraph Platform["W&B Weave Platform"]
D[Traces]
E[Evaluations]
F[Leaderboards]
G[Versions]
H[Feedback]
I[Monitoring]
end
Platform --> Ecosystem
subgraph Ecosystem["W&B Ecosystem"]
J[Models / Fine-tuning / Registry / Experiments]
end
===== Getting Started =====
pip install weave
import weave
# Initialize a Weave project
weave.init("my-llm-project")
===== Core Concepts: @weave.op =====
The @weave.op() decorator is the foundation of Weave. It turns any function into a traceable, versioned operation that logs inputs, outputs, and metadata automatically.
import weave
import openai
weave.init("my-app")
client = openai.OpenAI()
@weave.op()
def generate_answer(question: str) -> str:
"Generate an answer using an LLM. Automatically traced by Weave."
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
@weave.op()
def check_relevance(question: str, answer: str) -> dict:
"Score whether the answer is relevant to the question."
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user",
"content": f"Is this answer relevant? "
f"Question: {question} Answer: {answer} "
f"Reply yes or no."}
]
)
is_relevant = "yes" in response.choices[0].message.content.lower()
return {"relevant": is_relevant}
# Every call is traced with full I/O logging
answer = generate_answer("What is retrieval-augmented generation?")
relevance = check_relevance("What is RAG?", answer)
===== Evaluation Framework =====
Weaves evaluation system benchmarks LLM applications against datasets using custom scorers:
import weave
from weave import Evaluation, Model
import asyncio
weave.init("eval-project")
# Define a Model with a predict method
class QAModel(Model):
model_name: str
temperature: float = 0.0
@weave.op()
def predict(self, question: str) -> dict:
response = client.chat.completions.create(
model=self.model_name,
temperature=self.temperature,
messages=[{"role": "user", "content": question}]
)
return {"answer": response.choices[0].message.content}
# Define scorers
@weave.op()
def exact_match(expected: str, output: dict) -> dict:
return {"match": expected.lower() == output["answer"].lower()}
@weave.op()
def length_check(output: dict) -> dict:
return {"reasonable_length": 10 < len(output["answer"]) < 2000}
# Create evaluation dataset
dataset = [
{"question": "What is the capital of France?", "expected": "Paris"},
{"question": "What is 2 + 2?", "expected": "4"},
{"question": "Who wrote Hamlet?", "expected": "Shakespeare"},
]
# Run evaluation
model = QAModel(model_name="gpt-4o-mini")
evaluation = Evaluation(dataset=dataset, scorers=[exact_match, length_check])
asyncio.run(evaluation.evaluate(model))
# Results appear in Weave UI with aggregated metrics
===== Leaderboards =====
Weave Leaderboards provide color-coded comparison matrices for ranking models and prompts:
- Run evaluations across multiple model variants
- Select evaluation runs in the UI and click Compare
- View performance matrices with baseline highlighting
- Identify regressions and improvements at a glance
This is particularly valuable for systematic prompt optimization and model selection.
===== Pre-built Scorers =====
Weave includes production-ready scorers for common quality checks:
* **Hallucination detection** -- Checks outputs against source context
* **Toxicity/moderation** -- Flags harmful or inappropriate content
* **Context precision** -- Measures retrieval relevance in RAG pipelines
* **Token cost tracking** -- Monitors spending across calls
* **Latency measurement** -- Tracks response times
===== Production Monitoring =====
Weave applies evaluation scorers to live production traffic:
* Set guardrails that alert on quality degradation
* Monitor token costs and latency in real-time
* Collect human feedback through the UI or API
* Compare production behavior against evaluation baselines
===== W&B Ecosystem Integration =====
Weave connects with the broader Weights & Biases platform:
* **W&B Models** -- Track model training and fine-tuning
* **W&B Registry** -- Version datasets, models, prompts, and code
* **Experiments** -- Traditional ML experiment tracking
* **Serverless fine-tuning** -- RL-based model improvement
This enables end-to-end workflows from model development through production monitoring.
===== Comparison to LangSmith =====
| ^ W&B Weave ^ LangSmith ^
| **Focus** | Full AI lifecycle with W&B | LLM app observability |
| **Tracing** | @weave.op decorator | Auto (LangChain) or SDK |
| **Evaluation** | Datasets + custom scorers | Datasets + LLM judges |
| **Unique** | Leaderboards, W&B ecosystem | LangGraph integration |
| **Monitoring** | Production guardrails | Dashboard metrics |
| **Best for** | W&B-native ML teams | LangChain users |
===== References =====
* [[https://github.com/wandb/weave|Weave on GitHub]]
* [[https://docs.wandb.ai/weave/|Weave Documentation]]
* [[https://wandb.ai/site/weave/|Weave Product Page]]
* [[https://wandb.ai/site/|Weights & Biases]]
===== See Also =====
* [[langsmith|LangSmith]] -- LangChains observability platform
* [[langfuse|Langfuse]] -- Open-source LLM observability
* [[mlflow|MLflow]] -- Open-source ML lifecycle management
* [[wandb|Weights & Biases]] -- ML experiment tracking platform