Synthetic Data Generation Agents

Synthetic Data Generation Agents are agentic AI pipelines that autonomously create high-quality training datasets by decomposing complex data generation tasks into manageable subtasks executed by specialized LLM-based agents. The AgentSynth framework, published as a conference paper at ICLR 2026¹⁾, demonstrates this approach for generating diverse computer-use task trajectories at scale.

Overview

Training capable AI agents requires large volumes of high-quality, diverse task data with corresponding trajectories. Human annotation is expensive (often hundreds of dollars per trajectory) and difficult to scale. Agentic synthetic data generation addresses this by leveraging information asymmetry — the principle that executing a task step-by-step is significantly easier than reasoning about the complete solution at once.

By decomposing generation into forward-execution subtasks, agentic pipelines produce datasets that are simple to create but challenging to solve, providing both training data and discriminative benchmarks. Recent research demonstrates that using AI agents as data scientists to automatically create training and evaluation examples through self-instruct loops produces higher-quality, more discriminative examples than passive synthetic data pipelines, with 34-point performance gaps observed on computer science question-answering tasks²⁾..

AgentSynth Framework

AgentSynth is a scalable, cost-efficient pipeline for automatically synthesizing task and trajectory datasets for generalist computer-use agents. Developed at UC Berkeley by Jingxu Xie, Dylan Xu, Xuandong Zhao, and Dawn Song.

Multi-Agent Architecture

AgentSynth deploys six distinct LLM-based agents in a coordinated pipeline:

Task Proposer — Generates candidate task descriptions based on available tools and environments
Task Executor — Attempts to execute proposed tasks, producing action trajectories
Task Verifier — Validates that executed trajectories correctly complete the proposed task
Task Reviser — Refines failed or ambiguous tasks based on execution feedback
Follow-up Task Proposer — Generates compositional follow-up tasks that build on completed subtasks
Task Summarizer — Produces clean task descriptions and metadata for the final dataset

Difficulty Modulation

A key innovation is precise control over task complexity by varying the number of composed subtasks. Each individual subtask is straightforward, but chaining them creates increasingly challenging long-horizon tasks:

Difficulty Level	Subtasks	Agent Success Rate
Level 1	1	18%
Level 2	2	12%
Level 3	3	8%
Level 6	6	4%

This steep performance degradation demonstrates the benchmark's discriminative power and highlights substantial room for agent improvement.

Code Example

Simplified agentic synthetic data generation pipeline:

from dataclasses import dataclass
 
@dataclass
class SubTask:
    description: str
    tools_required: liststr
    trajectory: listdict
    verified: bool = False
 
class AgentSynthPipeline:
    def __init__(self, llm_client, environment):
        self.llm = llm_client
        self.env = environment
 
    def propose_subtask(self, context: dict) -> SubTask:
        prompt = f"Propose a simple computer task using: {context['available_tools']}"
        response = self.llm.generate(prompt)
        return SubTask(
            description=response.text,
            tools_required=response.tools,
            trajectory=[]
        )
 
    def execute_subtask(self, subtask: SubTask) -> SubTask:
        trajectory = []
        state = self.env.reset()
        for step in range(self.env.max_steps):
            action = self.llm.generate(
                f"Task: {subtask.description}\nState: {state}\nNext action:"
            )
            state, done = self.env.step(action)
            trajectory.append({"state": state, "action": action})
            if done:
                break
        subtask.trajectory = trajectory
        return subtask
 
    def verify_subtask(self, subtask: SubTask) -> bool:
        verification = self.llm.generate(
            f"Did this trajectory complete the task?\n"
            f"Task: {subtask.description}\n"
            f"Trajectory: {subtask.trajectory}"
        )
        subtask.verified = verification.text.lower().startswith("yes")
        return subtask.verified
 
    def compose_tasks(self, subtasks: list[SubTask], difficulty: int) -> dict:
        selected = subtasks[:difficulty]
        composed_description = " Then, ".join(s.description for s in selected)
        composed_trajectory = []
        for s in selected:
            composed_trajectory.extend(s.trajectory)
        return {
            "description": composed_description,
            "difficulty": difficulty,
            "trajectory": composed_trajectory,
            "num_subtasks": len(selected)
        }

Cost Efficiency

AgentSynth achieves an average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations³⁾. Over 6,000 diverse and realistic tasks were generated using the pipeline, integrated with the OSWorld environment for authentic computer tool interactions.⁴⁾⁵⁾

Broader Ecosystem

Other frameworks complement AgentSynth in the synthetic data generation space:

NVIDIA NeMo — Infrastructure for configuring seed datasets, column structures, and LLM-prompted generation with quality evaluation
Tonic Fabricate — Conversational agentic interface for natural-language dataset specification with real-time generation
Schema-Aware Generation — Maintaining referential integrity across related tables while generating statistically consistent synthetic records