Table of Contents

Image Editing Agents

LLM-powered agents for image editing orchestrate diffusion models through multimodal reasoning, enabling iterative, dialogue-driven editing workflows that maintain coherence across multi-turn interactions.1)2)

Overview

Traditional image editing with diffusion models requires users to craft precise prompts for each edit. LLM agents transform this into a conversational process where the model reasons about user intent, decomposes complex edits into sequential operations, and orchestrates diffusion model sampling through attention modulation. MIRA introduces multimodal iterative reasoning for image editing, while Talk2Image deploys a multi-agent architecture to prevent intention drift in multi-turn editing sessions.

MIRA: Multimodal Iterative Reasoning Agent

MIRA (Multimodal Iterative Reasoning Agent) applies agentic reasoning to image editing by combining visual understanding with language-guided edit planning. The system processes user instructions through an iterative loop of:

Talk2Image: Multi-Agent Dialogue Editing

Talk2Image employs four specialized agents to handle multi-turn image editing:

Diffusion Model Orchestration

The LLM orchestrates diffusion models by generating structured edit operations. For a scene, the LLM outputs:

<latex>O^{rm} = \{o_i^{rm}\}_{i=1}^{n^{rm}}, \quad O^{gen} = \{o_i^{gen}\}_{i=1}^{n^{gen}}</latex>

along with masks $m_i^{rm}$ for removal and bounding boxes $b_i^{gen}$ for generation.

Object Removal via Attention Modulation: During diffusion denoising, self-attention is modulated to erase masked regions. For feature maps $X$ and keys $K_s$:

<latex> ext{Attn}(Q, K_s)[j] = 0 \quad ext{if} \quad M_{rm}[j] = 1</latex>

This ensures unmasked regions seamlessly fill erased areas across all sampling steps.

Object Insertion: Objects are pre-generated with spatial attention enhancement from noise level $t = 1.0$ to $T_n = 0.6$, then blended with background using multi-sampler coordination up to $T_m = 0.8$.

The iterative denoising process follows:

<latex>x_{t-1} \sim p_ heta(x_{t-1} | x_t, c_{edit})</latex>

where $c_{edit}$ encodes the LLM-generated editing instructions.

Code Example

from dataclasses import dataclass
import torch
 
@dataclass
class EditOperation:
    op_type: str  # "remove", "insert", "modify"
    region: torch.Tensor  # mask or bounding box
    description: str
 
class ImageEditAgent:
    def __init__(self, llm, diffusion_model, vlm):
        self.llm = llm
        self.diffusion = diffusion_model
        self.vlm = vlm
 
    def parse_intent(self, dialogue_history: list[str],
                     current_image: torch.Tensor) -> list[EditOperation]:
        scene_description = self.vlm.describe(current_image)
        plan = self.llm.generate(
            f"Dialogue: {dialogue_history}\n"
            f"Scene: {scene_description}\n"
            f"Plan sequential edit operations:"
        )
        return self.parse_operations(plan, current_image)
 
    def execute_removal(self, image: torch.Tensor,
                        mask: torch.Tensor) -> torch.Tensor:
        noise = torch.randn_like(image)
        for t in reversed(range(self.diffusion.n_steps)):
            attn_mod = lambda q, k: self.modulate_attention(q, k, mask)
            image = self.diffusion.denoise_step(
                image, t, attention_hook=attn_mod
            )
        return image
 
    def modulate_attention(self, queries, keys, mask):
        attn_weights = torch.matmul(queries, keys.T)
        attn_weights[:, mask.flatten().bool()] = 0.0
        return attn_weights / attn_weights.sum(dim=-1, keepdim=True)
 
    def iterative_edit(self, image, instructions, max_rounds=3):
        for round_idx in range(max_rounds):
            ops = self.parse_intent(instructions, image)
            for op in ops:
                if op.op_type == "remove":
                    image = self.execute_removal(image, op.region)
                elif op.op_type == "insert":
                    image = self.execute_insertion(image, op)
            quality = self.vlm.evaluate(image, instructions[-1])
            if quality > 0.85:
                break
        return image

Architecture

graph TD A[User Instruction] --> B[Dialogue History] B --> C[Intention Parser Agent] C --> D[Task Decomposer] D --> E[Edit Operation Queue] E --> F{Operation Type} F -->|Remove| G[Attention Modulation] F -->|Insert| H[Spatial Attention Enhancement] F -->|Modify| I[Conditional Denoising] G --> J[Diffusion Model] H --> J I --> J J --> K[Edited Image] K --> L[Multi-View Evaluator] L --> M{Quality Threshold?} M -->|Pass| N[Output to User] M -->|Fail| O[Feedback to Decomposer] O --> D N --> P[Update Dialogue History] P --> B

See Also

References