====== Synthetic Data Generation Agents ======
**Synthetic Data Generation Agents** are [[agentic_ai|agentic AI]] pipelines that autonomously create high-quality training datasets by decomposing complex data generation tasks into manageable subtasks executed by specialized LLM-based agents. The **AgentSynth** framework, published as a conference paper at **ICLR 2026**(([[https://openreview.net/forum?id=CoBxmXThM6|OpenReview — AgentSynth (ICLR 2026]])), demonstrates this approach for generating diverse computer-use task trajectories at scale.

===== Overview =====
Training capable AI agents requires large volumes of high-quality, diverse task data with corresponding trajectories. Human annotation is expensive (often hundreds of dollars per trajectory) and difficult to scale. Agentic synthetic data generation addresses this by leveraging **information asymmetry** — the principle that executing a task step-by-step is significantly easier than reasoning about the complete solution at once.

By decomposing generation into forward-execution subtasks, agentic pipelines produce datasets that are simple to create but challenging to solve, providing both training data and discriminative benchmarks. Recent research demonstrates that using AI agents as data scientists to automatically create training and evaluation examples through self-instruct loops produces higher-quality, more discriminative examples than passive synthetic data pipelines, with 34-point performance gaps observed on computer science question-answering tasks(([[https://news.smol.ai/issues/26-05-04-not-much/|AI News (smol.ai) — Agentic Data Generation (2026]]))..

===== AgentSynth Framework =====
AgentSynth is a scalable, cost-efficient pipeline for automatically synthesizing task and trajectory datasets for generalist computer-use agents. Developed at UC Berkeley by Jingxu Xie, Dylan Xu, Xuandong Zhao, and Dawn Song.

=== Multi-Agent Architecture ===
AgentSynth deploys six distinct LLM-based agents in a coordinated pipeline:

  * **Task Proposer** — Generates candidate task descriptions based on available tools and environments
  * **Task Executor** — Attempts to execute proposed tasks, producing action trajectories
  * **Task Verifier** — Validates that executed trajectories correctly complete the proposed task
  * **Task Reviser** — Refines failed or ambiguous tasks based on execution feedback
  * **Follow-up Task Proposer** — Generates compositional follow-up tasks that build on completed subtasks
  * **Task Summarizer** — Produces clean task descriptions and metadata for the final dataset

=== Difficulty Modulation ===
A key innovation is precise control over task complexity by varying the number of composed subtasks. Each individual subtask is straightforward, but chaining them creates increasingly challenging long-horizon tasks:

^ Difficulty Level ^ Subtasks ^ Agent Success Rate ^
| Level 1 | 1 | 18% |
| Level 2 | 2 | 12% |
| Level 3 | 3 | 8% |
| Level 6 | 6 | 4% |

This steep performance degradation demonstrates the benchmark's discriminative power and highlights substantial room for agent improvement.

===== Code Example =====
Simplified agentic synthetic data generation pipeline:

<code python>
from dataclasses import dataclass

@dataclass
class SubTask:
    description: str
    tools_required: liststr
    trajectory: listdict
    verified: bool = False

class AgentSynthPipeline:
    def __init__(self, llm_client, environment):
        self.llm = llm_client
        self.env = environment

    def propose_subtask(self, context: dict) -> SubTask:
        prompt = f"Propose a simple computer task using: {context['available_tools']}"
        response = self.llm.generate(prompt)
        return SubTask(
            description=response.text,
            tools_required=response.tools,
            trajectory=[]
        )

    def execute_subtask(self, subtask: SubTask) -> SubTask:
        trajectory = []
        state = self.env.reset()
        for step in range(self.env.max_steps):
            action = self.llm.generate(
                f"Task: {subtask.description}\nState: {state}\nNext action:"
            )
            state, done = self.env.step(action)
            trajectory.append({"state": state, "action": action})
            if done:
                break
        subtask.trajectory = trajectory
        return subtask

    def verify_subtask(self, subtask: SubTask) -> bool:
        verification = self.llm.generate(
            f"Did this trajectory complete the task?\n"
            f"Task: {subtask.description}\n"
            f"Trajectory: {subtask.trajectory}"
        )
        subtask.verified = verification.text.lower().startswith("yes")
        return subtask.verified

    def compose_tasks(self, subtasks: list[SubTask], difficulty: int) -> dict:
        selected = subtasks[:difficulty]
        composed_description = " Then, ".join(s.description for s in selected)
        composed_trajectory = []
        for s in selected:
            composed_trajectory.extend(s.trajectory)
        return {
            "description": composed_description,
            "difficulty": difficulty,
            "trajectory": composed_trajectory,
            "num_subtasks": len(selected)
        }
</code>

===== Cost Efficiency =====
AgentSynth achieves an average cost of **$0.60 per trajectory**, orders of magnitude cheaper than human annotations(([[https://arxiv.org/abs/2506.14205|arXiv:2506.14205 — AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents]])). Over 6,000 diverse and realistic tasks were generated using the pipeline, integrated with the OSWorld environment for authentic computer tool interactions.(([[https://github.com/sunblaze-ucb/AgentSynth|AgentSynth GitHub Repository.]]))(([[https://iclr.cc/virtual/2026/poster/10010827|ICLR 2026 Poster: AgentSynth.]]))

===== Broader Ecosystem =====
Other frameworks complement AgentSynth in the synthetic data generation space:

  * **[[nvidia|NVIDIA]] NeMo** — Infrastructure for configuring seed datasets, column structures, and LLM-prompted generation with quality evaluation
  * **Tonic Fabricate** — Conversational agentic interface for natural-language dataset specification with real-time generation
  * **Schema-Aware Generation** — Maintaining referential integrity across related tables while generating statistically consistent synthetic records

===== See Also =====
  * [[agentic_data_engineering|Agentic Data Engineering]]
  * [[synthetic_training_data|Synthetic Training Data]]
  * [[data_science_agents|Data Science Agents: DatawiseAgent]]
  * [[agentic_software|Agentic Software]]
  * [[databricks_week_of_agents|Databricks Week of Agents]]

===== References =====