World of Workflows Benchmark

World of Workflows (WoW) is a realistic enterprise agent benchmark introduced by Gupta et al. from Skyfall AI (arXiv:2601.22130) that evaluates LLM agents in a ServiceNow-based environment incorporating over 4,000 business rules and 55 active workflows.¹⁾ The benchmark reveals a critical gap between consumer-grade agentic task completion and enterprise-grade reliability: frontier LLMs suffer from “dynamics blindness,” consistently failing to predict invisible cascading side effects of their actions. GPT-5.1 achieves only 2% constrained success rate and Gemini-3-Pro only 6%, demonstrating that current agents are fundamentally unprepared for enterprise deployment.

Enterprise Environment Design

WoW simulates a production-grade ServiceNow instance with realistic complexity:²⁾

4,000+ business rules: Conditional logic governing data validation, field auto-population, access control, and cross-table consistency
55 active workflows: Multi-step processes triggered by record state changes, involving approvals, notifications, escalations, and database mutations
Interconnected databases: Relational tables where changes cascade through foreign key relationships, workflow triggers, and business rule chains
Limited observability: Agents interact through standard ServiceNow APIs without visibility into the underlying workflow engine state

WoW-Bench: Task Taxonomy

WoW-bench comprises 234 tasks across four categories:

Autonomous Task Completion: Standard agentic tasks (create, update, query records) evaluated by goal achievement
Data-Level Constraint Understanding: Tasks requiring knowledge of field validation, mandatory fields, and referential integrity
Dynamics Prediction (Forward): Predicting what database state changes will occur after an action (audit prediction)
Tool Prediction (Inverse Dynamics): Determining what action sequence would produce a desired state change

Tasks are generated via Tool-Dependency Graph Sampling, creating realistic multi-hop trajectories that test connected data flows rather than isolated API calls.

# Illustration of the dynamics blindness problem in enterprise agents
class EnterpriseAgent:
    def update_incident(self, incident_id: str, new_state: str):
        # Agent sees: simple state update
        self.api.update("incident", incident_id, {"state": new_state})
 
        # Agent does NOT see the cascading effects:
        # 1. Business rule auto-sets resolution_code if state=resolved
        # 2. Workflow triggers approval chain for priority=1 incidents
        # 3. Business rule updates parent change_request status
        # 4. Notification workflow emails assignment group
        # 5. SLA business rule recalculates response metrics
        # Result: 5+ hidden database mutations from one API call
 
class WoWEvaluator:
    def evaluate(self, agent_actions: list, ground_truth: dict) -> dict:
        tsr = self.compute_task_success_rate(agent_actions, ground_truth)
        tsruc = self.compute_constrained_success(agent_actions, ground_truth)
        return {"TSR": tsr, "TSRUC": tsruc}

Key Findings

Dynamics Blindness

Frontier LLMs consistently fail to predict the invisible, cascading side effects triggered by workflows and business rules. Even when an agent completes the primary task correctly (high TSR), it unknowingly violates constraints imposed by hidden system dynamics (near-zero TSRUC).

Performance Results

Model	TSR (Goal)	TSRUC (Constrained)	TSRUC with Audits
GPT-5.1	32%	2%	14%
Gemini-3-Pro	42%	6%	16%

The massive gap between TSR and TSRUC quantifies dynamics blindness: agents can perform surface-level tasks but cannot reason about hidden state transitions.

The Observability Gap

<latex> \text{Observability Gap} = \text{TSR} - \text{TSRUC} = P(\text{goal} \mid a) - P(\text{goal} \wedge \text{constraints} \mid a) </latex>

When agents receive audit-level feedback (detailed database state change logs), TSRUC improves from 2% to 14% for GPT-5.1, confirming that the information deficit is a primary failure mode.³⁾

Why Grounded World Modeling is Needed

WoW motivates a new paradigm for enterprise agents: grounded world modeling. Rather than relying solely on API responses, agents must:

Mentally simulate hidden state transitions triggered by their actions
Maintain internal models of business rule chains and workflow triggers
Predict cascading effects before committing actions
Bridge the observability gap when high-fidelity feedback is unavailable

This is analogous to how experienced enterprise administrators develop intuition for system behavior through years of operational exposure.

Enterprise vs. Consumer Benchmarks

Existing agent benchmarks (WebArena, SWE-bench, OSWorld) test surface-level task completion similar to consumer applications. WoW introduces enterprise-specific challenges:

Hidden workflows: Automated processes invisible to the agent
Cascading side effects: Single actions trigger chains of database mutations
Large database state: Thousands of interrelated records with complex schemas
Constraint density: Hundreds of business rules affecting each table

References

¹⁾ , ³⁾

Gupta et al., "World of Workflows: A Benchmark for Bringing World Models to Enterprise Systems," arXiv:2601.22130, 2026

²⁾

WoW GitHub Repository

AI Agent Knowledge Base

Sidebar

Table of Contents

World of Workflows Benchmark

Enterprise Environment Design

WoW-Bench: Task Taxonomy

Key Findings

Dynamics Blindness

Performance Results

The Observability Gap

Why Grounded World Modeling is Needed

Enterprise vs. Consumer Benchmarks

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

World of Workflows Benchmark

Enterprise Environment Design

WoW-Bench: Task Taxonomy

Key Findings

Dynamics Blindness

Performance Results

The Observability Gap

Why Grounded World Modeling is Needed

Enterprise vs. Consumer Benchmarks

See Also

References

Page Tools