LLM-powered agents for image editing orchestrate diffusion models through multimodal reasoning, enabling iterative, dialogue-driven editing workflows that maintain coherence across multi-turn interactions.1)2)
Traditional image editing with diffusion models requires users to craft precise prompts for each edit. LLM agents transform this into a conversational process where the model reasons about user intent, decomposes complex edits into sequential operations, and orchestrates diffusion model sampling through attention modulation. MIRA introduces multimodal iterative reasoning for image editing, while Talk2Image deploys a multi-agent architecture to prevent intention drift in multi-turn editing sessions.
MIRA (Multimodal Iterative Reasoning Agent) applies agentic reasoning to image editing by combining visual understanding with language-guided edit planning. The system processes user instructions through an iterative loop of:
Talk2Image employs four specialized agents to handle multi-turn image editing:
The LLM orchestrates diffusion models by generating structured edit operations. For a scene, the LLM outputs:
<latex>O^{rm} = \{o_i^{rm}\}_{i=1}^{n^{rm}}, \quad O^{gen} = \{o_i^{gen}\}_{i=1}^{n^{gen}}</latex>
along with masks $m_i^{rm}$ for removal and bounding boxes $b_i^{gen}$ for generation.
Object Removal via Attention Modulation: During diffusion denoising, self-attention is modulated to erase masked regions. For feature maps $X$ and keys $K_s$:
<latex> ext{Attn}(Q, K_s)[j] = 0 \quad ext{if} \quad M_{rm}[j] = 1</latex>
This ensures unmasked regions seamlessly fill erased areas across all sampling steps.
Object Insertion: Objects are pre-generated with spatial attention enhancement from noise level $t = 1.0$ to $T_n = 0.6$, then blended with background using multi-sampler coordination up to $T_m = 0.8$.
The iterative denoising process follows:
<latex>x_{t-1} \sim p_ heta(x_{t-1} | x_t, c_{edit})</latex>
where $c_{edit}$ encodes the LLM-generated editing instructions.
from dataclasses import dataclass import torch @dataclass class EditOperation: op_type: str # "remove", "insert", "modify" region: torch.Tensor # mask or bounding box description: str class ImageEditAgent: def __init__(self, llm, diffusion_model, vlm): self.llm = llm self.diffusion = diffusion_model self.vlm = vlm def parse_intent(self, dialogue_history: list[str], current_image: torch.Tensor) -> list[EditOperation]: scene_description = self.vlm.describe(current_image) plan = self.llm.generate( f"Dialogue: {dialogue_history}\n" f"Scene: {scene_description}\n" f"Plan sequential edit operations:" ) return self.parse_operations(plan, current_image) def execute_removal(self, image: torch.Tensor, mask: torch.Tensor) -> torch.Tensor: noise = torch.randn_like(image) for t in reversed(range(self.diffusion.n_steps)): attn_mod = lambda q, k: self.modulate_attention(q, k, mask) image = self.diffusion.denoise_step( image, t, attention_hook=attn_mod ) return image def modulate_attention(self, queries, keys, mask): attn_weights = torch.matmul(queries, keys.T) attn_weights[:, mask.flatten().bool()] = 0.0 return attn_weights / attn_weights.sum(dim=-1, keepdim=True) def iterative_edit(self, image, instructions, max_rounds=3): for round_idx in range(max_rounds): ops = self.parse_intent(instructions, image) for op in ops: if op.op_type == "remove": image = self.execute_removal(image, op.region) elif op.op_type == "insert": image = self.execute_insertion(image, op) quality = self.vlm.evaluate(image, instructions[-1]) if quality > 0.85: break return image