Table of Contents

Synthetic Data Generation Agents

Synthetic Data Generation Agents are agentic AI pipelines that autonomously create high-quality training datasets by decomposing complex data generation tasks into manageable subtasks executed by specialized LLM-based agents. The AgentSynth framework, published as a conference paper at ICLR 2026, demonstrates this approach for generating diverse computer-use task trajectories at scale.

Overview

Training capable AI agents requires large volumes of high-quality, diverse task data with corresponding trajectories. Human annotation is expensive (often hundreds of dollars per trajectory) and difficult to scale. Agentic synthetic data generation addresses this by leveraging information asymmetry — the principle that executing a task step-by-step is significantly easier than reasoning about the complete solution at once.

By decomposing generation into forward-execution subtasks, agentic pipelines produce datasets that are simple to create but challenging to solve, providing both training data and discriminative benchmarks.

AgentSynth Framework

AgentSynth is a scalable, cost-efficient pipeline for automatically synthesizing task and trajectory datasets for generalist computer-use agents. Developed at UC Berkeley by Jingxu Xie, Dylan Xu, Xuandong Zhao, and Dawn Song.

Multi-Agent Architecture

AgentSynth deploys six distinct LLM-based agents in a coordinated pipeline:

Difficulty Modulation

A key innovation is precise control over task complexity by varying the number of composed subtasks. Each individual subtask is straightforward, but chaining them creates increasingly challenging long-horizon tasks:

Difficulty Level Subtasks Agent Success Rate
Level 1 1 18%
Level 2 2 12%
Level 3 3 8%
Level 6 6 4%

This steep performance degradation demonstrates the benchmark's discriminative power and highlights substantial room for agent improvement.

Code Example

Simplified agentic synthetic data generation pipeline:

from dataclasses import dataclass
 
@dataclass
class SubTask:
    description: str
    tools_required: list[str]
    trajectory: list[dict]
    verified: bool = False
 
class AgentSynthPipeline:
    def __init__(self, llm_client, environment):
        self.llm = llm_client
        self.env = environment
 
    def propose_subtask(self, context: dict) -> SubTask:
        prompt = f"Propose a simple computer task using: {context['available_tools']}"
        response = self.llm.generate(prompt)
        return SubTask(
            description=response.text,
            tools_required=response.tools,
            trajectory=[]
        )
 
    def execute_subtask(self, subtask: SubTask) -> SubTask:
        trajectory = []
        state = self.env.reset()
        for step in range(self.env.max_steps):
            action = self.llm.generate(
                f"Task: {subtask.description}\nState: {state}\nNext action:"
            )
            state, done = self.env.step(action)
            trajectory.append({"state": state, "action": action})
            if done:
                break
        subtask.trajectory = trajectory
        return subtask
 
    def verify_subtask(self, subtask: SubTask) -> bool:
        verification = self.llm.generate(
            f"Did this trajectory complete the task?\n"
            f"Task: {subtask.description}\n"
            f"Trajectory: {subtask.trajectory}"
        )
        subtask.verified = verification.text.lower().startswith("yes")
        return subtask.verified
 
    def compose_tasks(self, subtasks: list[SubTask], difficulty: int) -> dict:
        selected = subtasks[:difficulty]
        composed_description = " Then, ".join(s.description for s in selected)
        composed_trajectory = []
        for s in selected:
            composed_trajectory.extend(s.trajectory)
        return {
            "description": composed_description,
            "difficulty": difficulty,
            "trajectory": composed_trajectory,
            "num_subtasks": len(selected)
        }

Cost Efficiency

AgentSynth achieves an average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations. Over 6,000 diverse and realistic tasks were generated using the pipeline, integrated with the OSWorld environment for authentic computer tool interactions.

Broader Ecosystem

Other frameworks complement AgentSynth in the synthetic data generation space:

References

See Also