====== Multimodal Agent Architectures ======
Multimodal agent architectures enable AI agents to process text, images, audio, and video within a unified perception-reasoning-action loop. Unlike simple "vision agents" that bolt image understanding onto text-only systems, true multimodal agents fuse information across modalities at the architectural level, enabling cross-modal reasoning, omni-modal planning, and tool use that spans sensory domains.
graph TD
T[Text Input] --> ENC[Encoders]
I[Image Input] --> ENC
AU[Audio Input] --> ENC
ENC --> UR[Unified Representation]
UR --> R[Reasoning Module]
R --> TS[Tool Selection]
TS --> MO[Multimodal Output]
===== Beyond Vision Agents =====
First-generation multimodal agents (e.g., early GPT-4V integrations) treated vision as an add-on: capture a screenshot, describe it in text, then reason over the description. This "late extraction" approach loses critical information:
* **Temporal dynamics:** Video requires understanding motion, sequence, and change over time
* **Audio context:** Tone, music, environmental sounds carry semantic meaning text cannot capture
* **Cross-modal correlation:** The relationship //between// modalities (lip sync, gesture-speech alignment) is lost when modalities are processed independently
* **Real-time interaction:** Sequential processing of modalities introduces unacceptable latency
True multimodal agents process all modalities simultaneously in a unified loop, maintaining cross-modal attention throughout reasoning and planning.
===== Unified Agent Loop Architecture =====
A multimodal agent loop consists of three synchronized phases:
**1. Perception Phase:**
* Modality-specific encoders convert raw data into vector representations
* Vision Transformer (ViT) processes image patches and video frames
* Audio encoder (Whisper-like) processes spectrograms
* Text tokenizer handles language input
* All outputs are projected into a shared embedding space
**2. Fusion and Reasoning Phase:**
* A transformer backbone performs cross-modal attention
* Tokens from all modalities attend to each other
* The model reasons holistically (e.g., linking a video scene, its audio, and a text query)
**3. Action Phase:**
* The agent generates text responses, tool calls, or multimodal outputs
* Actions can trigger new perceptions, forming the agent loop
* Planning operates across modalities
$$\mathbf{h}_{fused} = \text{CrossAttention}(\mathbf{h}_{text}, \mathbf{h}_{image}, \mathbf{h}_{audio}, \mathbf{h}_{video})$$
===== Early Fusion vs. Late Fusion =====
| **Aspect** | **Early Fusion** | **Late Fusion** |
| Mechanism | Modalities tokenized jointly before backbone | Separate encoders, outputs merged post-reasoning |
| Cross-modal depth | Deep (attention across all modalities) | Shallow (combination at decision layer) |
| Latency | Higher per-step (larger attention matrix) | Lower per-modality, higher for cross-modal |
| Best for | Real-time unified reasoning (GPT-4o, Gemini) | Modular systems, optional modalities |
| Example | GPT-4o native omni-modal processing | LangGraph with separate vision/audio tools |
Early fusion dominates native multimodal LLMs. Late fusion suits hybrid agent systems where modalities are optional or processed by specialized tools.
===== Key Model Architectures =====
**GPT-4o (OpenAI):** Native omni-modal architecture processing text, images, audio, and video through a single end-to-end transformer. All modalities are tokenized uniformly --- visual patches become tokens, audio segments become tokens --- enabling direct cross-modal reasoning without intermediate text descriptions. Sub-second latency for voice+vision tasks.
**Gemini 2.0/2.5 (Google DeepMind):** Built on Pathways-like infrastructure for scaled multi-modal training. Processes up to 6 images, 120 seconds of video, and audio simultaneously. Gemini 2.5 introduced native agentic capabilities with tool use across modalities. Supports 100+ languages across all modality combinations.
**Claude 3.5/4 (Anthropic):** Strong vision-language capabilities with image and document understanding. Tool use integrates visual analysis with code execution and web search. Audio processing available via tool-based pipelines.
# Multimodal agent loop with cross-modal tool use
from dataclasses import dataclass, field
@dataclass
class ModalInput:
text: str = None
images: list = field(default_factory=list)
audio: bytes = None
video: bytes = None
@dataclass
class AgentAction:
action_type: str # "respond", "tool_call", "request_input"
content: object = None
tool_name: str = None
tool_args: dict = field(default_factory=dict)
class MultimodalAgent:
def __init__(self, model_client, tools):
self.model = model_client
self.tools = tools
self.memory = []
def perceive(self, inputs):
messages = []
if inputs.text:
messages.append({"type": "text", "text": inputs.text})
for img in inputs.images:
messages.append({"type": "image", "data": img})
if inputs.audio:
messages.append({"type": "audio", "data": inputs.audio})
if inputs.video:
messages.append({"type": "video", "data": inputs.video})
return {"role": "user", "content": messages}
def reason_and_act(self, perception):
self.memory.append(perception)
response = self.model.create(
messages=self.memory,
tools=list(self.tools.values())
)
if response.tool_calls:
call = response.tool_calls[0]
return AgentAction("tool_call", tool_name=call.name, tool_args=call.args)
return AgentAction("respond", content=response.text)
def run_loop(self, initial_input, max_steps=10):
perception = self.perceive(initial_input)
for step in range(max_steps):
action = self.reason_and_act(perception)
if action.action_type == "respond":
return action.content
elif action.action_type == "tool_call":
result = self.tools[action.tool_name].execute(action.tool_args)
perception = self.perceive(ModalInput(text=f"Tool result: {result}"))
return "Max steps reached"
===== Cross-Modal Tool Use Patterns =====
Multimodal agents unlock tool use patterns impossible with text-only systems:
* **Visual analysis to code execution:** Analyze a chart image, extract data, run statistical tests
* **Audio-driven search:** Identify a sound or speech segment, search for related information
* **Video summarization pipeline:** Extract key frames, transcribe audio, generate structured summary
* **Document + voice interaction:** Read a PDF, answer voice questions with visual references
* **Environmental sensing:** Combine camera feed, microphone input, and sensor data for robotics
===== Omni-Modal Planning =====
Planning in multimodal agents decomposes tasks across modality-specific subtasks while maintaining cross-modal coherence:
- **Task decomposition:** "Analyze this meeting recording" becomes: extract video key moments, transcribe speech, identify speakers, correlate slides with discussion points
- **Modality-aware routing:** Route visual subtasks to vision-specialized models, audio to speech models, with a coordinator maintaining unified context
- **Cross-modal verification:** Use one modality to verify another (e.g., check if transcribed speech matches on-screen text)
===== Production Considerations =====
* **Token cost:** Video and audio modalities consume significantly more tokens than text; apply aggressive caching and compression
* **Latency:** Early fusion increases per-step compute; balance with streaming and progressive rendering
* **Context windows:** Video can exhaust context quickly; use frame sampling and audio chunking strategies
* **Modality fallbacks:** Design graceful degradation when a modality is unavailable or low quality
===== References =====
* [[https://arxiv.org/abs/2405.15071|GPT-4o System Card (OpenAI, 2024)]]
* [[https://arxiv.org/abs/2312.11805|Gemini: A Family of Highly Capable Multimodal Models (Google DeepMind, 2024)]]
* [[https://arxiv.org/abs/2601.12560|Agentic AI: Architectures, Taxonomies, and Evaluation of LLM Agents (2026)]]
* [[https://blog.bytebytego.com/p/multimodal-llms-basics-how-llms-process|How Multimodal LLMs Process Different Inputs (ByteByteGo, 2025)]]
* [[https://aws.amazon.com/blogs/machine-learning/build-an-agentic-multimodal-ai-assistant-with-amazon-nova-and-amazon-bedrock-data-automation/|Agentic Multimodal AI with Amazon Nova (AWS, 2025)]]
===== See Also =====
* [[small_language_model_agents]]
* [[agent_cost_optimization]]
* [[collective_agent_behavior]]