Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
PaLM-E is a 562 billion parameter embodied multimodal language model introduced by Driess et al. (2023) at Google, combining vision, language, and robotic control in a single end-to-end trained model.1)) With 2,457 citations, it is one of the most impactful works bridging foundation models and robotics, demonstrating that a unified model can perform manipulation, navigation, visual question answering, and language tasks without task-specific fine-tuning.
PaLM-E injects continuous sensor observations directly into a pre-trained language model's embedding space. The largest variant combines:2))
Multimodal inputs are encoded as token-like embeddings that interleave with text tokens:
$$x = [w_1, \ldots, w_n, \phi(o_1), w_{n+1}, \ldots, \phi(o_k), w_m]$$
where $w_i$ are text token embeddings and $\phi(o_j)$ are projected continuous observations (images, robot states, scene features). The projection function $\phi$ maps observations into the language model's embedding dimension via learned linear layers.
The model processes multimodal sentences with text and sensor data in arbitrary order, generating text outputs that serve as high-level plans for robotic control.4)
# Simplified PaLM-E multimodal input construction import torch import torch.nn as nn class PaLME(nn.Module): def __init__(self, palm_model, vit_encoder, state_dim): super().__init__() self.palm = palm_model self.vit = vit_encoder self.state_mlp = nn.Sequential( nn.Linear(state_dim, palm_model.embed_dim), nn.ReLU(), nn.Linear(palm_model.embed_dim, palm_model.embed_dim), ) self.projector = nn.Linear(vit_encoder.dim, palm_model.embed_dim) def encode_multimodal(self, text, images, robot_state): text_tokens = self.palm.tokenize(text) text_embeds = self.palm.embed(text_tokens) visual_tokens = self.projector(self.vit(images)) state_tokens = self.state_mlp(robot_state).unsqueeze(1) return torch.cat([text_embeds, visual_tokens, state_tokens], dim=1) def generate_plan(self, instruction, image, robot_state): tokens = self.encode_multimodal(instruction, image, robot_state) plan_text = self.palm.generate(inputs_embeds=tokens) return parse_actions(plan_text) def robotic_control_loop(self, task, environment): while not environment.task_complete(): image = environment.get_camera_image() state = environment.get_robot_state() plan = self.generate_plan(task, image, state) environment.execute(plan[0])
| Model | LLM | Vision Encoder | Total Parameters |
|---|---|---|---|
| PaLM-E-12B | PaLM-8B | ViT-4B | 12B |
| PaLM-E-84B | PaLM-62B | ViT-22B | 84B |
| PaLM-E-562B | PaLM-540B | ViT-22B | 562B |