Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
World of Workflows (WoW) is a realistic enterprise agent benchmark introduced by Gupta et al. from Skyfall AI (arXiv:2601.22130) that evaluates LLM agents in a ServiceNow-based environment incorporating over 4,000 business rules and 55 active workflows. The benchmark reveals a critical gap between consumer-grade agentic task completion and enterprise-grade reliability: frontier LLMs suffer from “dynamics blindness,” consistently failing to predict invisible cascading side effects of their actions. GPT-5.1 achieves only 2% constrained success rate and Gemini-3-Pro only 6%, demonstrating that current agents are fundamentally unprepared for enterprise deployment.
WoW simulates a production-grade ServiceNow instance with realistic complexity:
WoW-bench comprises 234 tasks across four categories:
Tasks are generated via Tool-Dependency Graph Sampling, creating realistic multi-hop trajectories that test connected data flows rather than isolated API calls.
# Illustration of the dynamics blindness problem in enterprise agents class EnterpriseAgent: def update_incident(self, incident_id: str, new_state: str): # Agent sees: simple state update self.api.update("incident", incident_id, {"state": new_state}) # Agent does NOT see the cascading effects: # 1. Business rule auto-sets resolution_code if state=resolved # 2. Workflow triggers approval chain for priority=1 incidents # 3. Business rule updates parent change_request status # 4. Notification workflow emails assignment group # 5. SLA business rule recalculates response metrics # Result: 5+ hidden database mutations from one API call class WoWEvaluator: def evaluate(self, agent_actions: list, ground_truth: dict) -> dict: tsr = self.compute_task_success_rate(agent_actions, ground_truth) tsruc = self.compute_constrained_success(agent_actions, ground_truth) return {"TSR": tsr, "TSRUC": tsruc}
Frontier LLMs consistently fail to predict the invisible, cascading side effects triggered by workflows and business rules. Even when an agent completes the primary task correctly (high TSR), it unknowingly violates constraints imposed by hidden system dynamics (near-zero TSRUC).
| Model | TSR (Goal) | TSRUC (Constrained) | TSRUC with Audits |
|---|---|---|---|
| GPT-5.1 | 32% | 2% | 14% |
| Gemini-3-Pro | 42% | 6% | 16% |
The massive gap between TSR and TSRUC quantifies dynamics blindness: agents can perform surface-level tasks but cannot reason about hidden state transitions.
<latex>
ext{Observability Gap} = ext{TSR} - ext{TSRUC} = P( ext{goal} \mid a) - P( ext{goal} \wedge ext{constraints} \mid a)
</latex>
When agents receive audit-level feedback (detailed database state change logs), TSRUC improves from 2% to 14% for GPT-5.1, confirming that the information deficit is a primary failure mode.
WoW motivates a new paradigm for enterprise agents: grounded world modeling. Rather than relying solely on API responses, agents must:
This is analogous to how experienced enterprise administrators develop intuition for system behavior through years of operational exposure.
Existing agent benchmarks (WebArena, SWE-bench, OSWorld) test surface-level task completion similar to consumer applications. WoW introduces enterprise-specific challenges: