Image Editing Agents

LLM-powered agents for image editing orchestrate diffusion models through multimodal reasoning, enabling iterative, dialogue-driven editing workflows that maintain coherence across multi-turn interactions.¹⁾²⁾

Overview

Traditional image editing with diffusion models requires users to craft precise prompts for each edit. LLM agents transform this into a conversational process where the model reasons about user intent, decomposes complex edits into sequential operations, and orchestrates diffusion model sampling through attention modulation. MIRA introduces multimodal iterative reasoning for image editing, while Talk2Image deploys a multi-agent architecture to prevent intention drift in multi-turn editing sessions.

MIRA: Multimodal Iterative Reasoning Agent

MIRA (Multimodal Iterative Reasoning Agent) applies agentic reasoning to image editing by combining visual understanding with language-guided edit planning. The system processes user instructions through an iterative loop of:

Instruction parsing: Understanding the user's editing intent from natural language
Visual grounding: Identifying relevant regions and objects in the current image
Edit planning: Decomposing complex edits into atomic operations
Execution: Guiding diffusion model sampling with attention constraints
Verification: Evaluating output quality and re-planning if needed

Talk2Image: Multi-Agent Dialogue Editing

Talk2Image employs four specialized agents to handle multi-turn image editing:

Intention Parser Agent: Processes dialogue history to extract coherent editing objectives
Task Decomposer Agent: Breaks edits into ordered sub-operations
Execution Agents: Specialized agents for different edit types (removal, insertion, style transfer)
Multi-View Evaluator: Assesses output from multiple criteria for quality feedback

Diffusion Model Orchestration

The LLM orchestrates diffusion models by generating structured edit operations. For a scene, the LLM outputs:

along with masks $m_i^{rm}$ for removal and bounding boxes $b_i^{gen}$ for generation.

Object Removal via Attention Modulation: During diffusion denoising, self-attention is modulated to erase masked regions. For feature maps $X$ and keys $K_s$:

This ensures unmasked regions seamlessly fill erased areas across all sampling steps.

Object Insertion: Objects are pre-generated with spatial attention enhancement from noise level $t = 1.0$ to $T_n = 0.6$, then blended with background using multi-sampler coordination up to $T_m = 0.8$.

The iterative denoising process follows:

where $c_{edit}$ encodes the LLM-generated editing instructions.

Code Example

from dataclasses import dataclass
import torch
 
@dataclass
class EditOperation:
    op_type: str  # "remove", "insert", "modify"
    region: torch.Tensor  # mask or bounding box
    description: str
 
class ImageEditAgent:
    def __init__(self, llm, diffusion_model, vlm):
        self.llm = llm
        self.diffusion = diffusion_model
        self.vlm = vlm
 
    def parse_intent(self, dialogue_history: list[str],
                     current_image: torch.Tensor) -> list[EditOperation]:
        scene_description = self.vlm.describe(current_image)
        plan = self.llm.generate(
            f"Dialogue: {dialogue_history}\n"
            f"Scene: {scene_description}\n"
            f"Plan sequential edit operations:"
        )
        return self.parse_operations(plan, current_image)
 
    def execute_removal(self, image: torch.Tensor,
                        mask: torch.Tensor) -> torch.Tensor:
        noise = torch.randn_like(image)
        for t in reversed(range(self.diffusion.n_steps)):
            attn_mod = lambda q, k: self.modulate_attention(q, k, mask)
            image = self.diffusion.denoise_step(
                image, t, attention_hook=attn_mod
            )
        return image
 
    def modulate_attention(self, queries, keys, mask):
        attn_weights = torch.matmul(queries, keys.T)
        attn_weights[:, mask.flatten().bool()] = 0.0
        return attn_weights / attn_weights.sum(dim=-1, keepdim=True)
 
    def iterative_edit(self, image, instructions, max_rounds=3):
        for round_idx in range(max_rounds):
            ops = self.parse_intent(instructions, image)
            for op in ops:
                if op.op_type == "remove":
                    image = self.execute_removal(image, op.region)
                elif op.op_type == "insert":
                    image = self.execute_insertion(image, op)
            quality = self.vlm.evaluate(image, instructions[-1])
            if quality > 0.85:
                break
        return image

Architecture

graph TD A[User Instruction] --> B[Dialogue History] B --> C[Intention Parser Agent] C --> D[Task Decomposer] D --> E[Edit Operation Queue] E --> F{Operation Type} F -->|Remove| G[Attention Modulation] F -->|Insert| H[Spatial Attention Enhancement] F -->|Modify| I[Conditional Denoising] G --> J[Diffusion Model] H --> J I --> J J --> K[Edited Image] K --> L[Multi-View Evaluator] L --> M{Quality Threshold?} M -->|Pass| N[Output to User] M -->|Fail| O[Feedback to Decomposer] O --> D N --> P[Update Dialogue History] P --> B

References

¹⁾

"MIRA: Multimodal Iterative Reasoning Agent for Image Editing" (2025)

²⁾

"Talk2Image: Multi-Agent Dialogue-Driven Image Editing" (2025)

Table of Contents