====== W&B Weave ====== **W&B Weave** is an open-source observability, evaluation, and monitoring toolkit by [[https://wandb.ai|Weights & Biases]] for developing reliable LLM and generative AI applications. It provides full tracing of LLM calls, systematic evaluation frameworks, production monitoring with guardrails, and deep integration with the broader W&B ecosystem for end-to-end AI development. ===== Overview ===== Weave addresses the fundamental challenges of LLM application development: non-determinism, subjective output quality, and sensitivity to prompt changes. It provides visibility into every LLM call, versions all artifacts (prompts, datasets, models, configs), and enables systematic experimentation and comparison. Core capabilities: * **Tracing** -- Full visibility into every LLM call, input, output, latency, and token cost * **Evaluation** -- Systematic benchmarking with custom scorers, datasets, and aggregated metrics * **Monitoring** -- Production guardrails for toxicity, hallucination, and quality issues on live traffic * **Versioning** -- Automatic versioning of prompts, datasets, models, and configurations * **Feedback** -- Human annotation collection via UI or API * **Leaderboards** -- Color-coded comparison matrices for ranking models and prompts ===== Architecture ===== graph TD subgraph App["Your LLM Application"] A["@weave.op: scorer"] B["@weave.op: chain"] C["weave.Model: predict"] end App --> Platform subgraph Platform["W&B Weave Platform"] D[Traces] E[Evaluations] F[Leaderboards] G[Versions] H[Feedback] I[Monitoring] end Platform --> Ecosystem subgraph Ecosystem["W&B Ecosystem"] J[Models / Fine-tuning / Registry / Experiments] end ===== Getting Started ===== pip install weave import weave # Initialize a Weave project weave.init("my-llm-project") ===== Core Concepts: @weave.op ===== The @weave.op() decorator is the foundation of Weave. It turns any function into a traceable, versioned operation that logs inputs, outputs, and metadata automatically. import weave import openai weave.init("my-app") client = openai.OpenAI() @weave.op() def generate_answer(question: str) -> str: "Generate an answer using an LLM. Automatically traced by Weave." response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": question} ] ) return response.choices[0].message.content @weave.op() def check_relevance(question: str, answer: str) -> dict: "Score whether the answer is relevant to the question." response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "user", "content": f"Is this answer relevant? " f"Question: {question} Answer: {answer} " f"Reply yes or no."} ] ) is_relevant = "yes" in response.choices[0].message.content.lower() return {"relevant": is_relevant} # Every call is traced with full I/O logging answer = generate_answer("What is retrieval-augmented generation?") relevance = check_relevance("What is RAG?", answer) ===== Evaluation Framework ===== Weaves evaluation system benchmarks LLM applications against datasets using custom scorers: import weave from weave import Evaluation, Model import asyncio weave.init("eval-project") # Define a Model with a predict method class QAModel(Model): model_name: str temperature: float = 0.0 @weave.op() def predict(self, question: str) -> dict: response = client.chat.completions.create( model=self.model_name, temperature=self.temperature, messages=[{"role": "user", "content": question}] ) return {"answer": response.choices[0].message.content} # Define scorers @weave.op() def exact_match(expected: str, output: dict) -> dict: return {"match": expected.lower() == output["answer"].lower()} @weave.op() def length_check(output: dict) -> dict: return {"reasonable_length": 10 < len(output["answer"]) < 2000} # Create evaluation dataset dataset = [ {"question": "What is the capital of France?", "expected": "Paris"}, {"question": "What is 2 + 2?", "expected": "4"}, {"question": "Who wrote Hamlet?", "expected": "Shakespeare"}, ] # Run evaluation model = QAModel(model_name="gpt-4o-mini") evaluation = Evaluation(dataset=dataset, scorers=[exact_match, length_check]) asyncio.run(evaluation.evaluate(model)) # Results appear in Weave UI with aggregated metrics ===== Leaderboards ===== Weave Leaderboards provide color-coded comparison matrices for ranking models and prompts: - Run evaluations across multiple model variants - Select evaluation runs in the UI and click Compare - View performance matrices with baseline highlighting - Identify regressions and improvements at a glance This is particularly valuable for systematic prompt optimization and model selection. ===== Pre-built Scorers ===== Weave includes production-ready scorers for common quality checks: * **Hallucination detection** -- Checks outputs against source context * **Toxicity/moderation** -- Flags harmful or inappropriate content * **Context precision** -- Measures retrieval relevance in RAG pipelines * **Token cost tracking** -- Monitors spending across calls * **Latency measurement** -- Tracks response times ===== Production Monitoring ===== Weave applies evaluation scorers to live production traffic: * Set guardrails that alert on quality degradation * Monitor token costs and latency in real-time * Collect human feedback through the UI or API * Compare production behavior against evaluation baselines ===== W&B Ecosystem Integration ===== Weave connects with the broader Weights & Biases platform: * **W&B Models** -- Track model training and fine-tuning * **W&B Registry** -- Version datasets, models, prompts, and code * **Experiments** -- Traditional ML experiment tracking * **Serverless fine-tuning** -- RL-based model improvement This enables end-to-end workflows from model development through production monitoring. ===== Comparison to LangSmith ===== | ^ W&B Weave ^ LangSmith ^ | **Focus** | Full AI lifecycle with W&B | LLM app observability | | **Tracing** | @weave.op decorator | Auto (LangChain) or SDK | | **Evaluation** | Datasets + custom scorers | Datasets + LLM judges | | **Unique** | Leaderboards, W&B ecosystem | LangGraph integration | | **Monitoring** | Production guardrails | Dashboard metrics | | **Best for** | W&B-native ML teams | LangChain users | ===== References ===== * [[https://github.com/wandb/weave|Weave on GitHub]] * [[https://docs.wandb.ai/weave/|Weave Documentation]] * [[https://wandb.ai/site/weave/|Weave Product Page]] * [[https://wandb.ai/site/|Weights & Biases]] ===== See Also ===== * [[langsmith|LangSmith]] -- LangChains observability platform * [[langfuse|Langfuse]] -- Open-source LLM observability * [[mlflow|MLflow]] -- Open-source ML lifecycle management * [[wandb|Weights & Biases]] -- ML experiment tracking platform