====== PaLM-E: An Embodied Multimodal Language Model ======
**PaLM-E** is a **562 billion parameter** embodied multimodal language model introduced by Driess et al. (2023) at **Google**, combining vision, language, and robotic control in a single end-to-end trained model.((https://arxiv.org/abs/2303.03378|Driess et al. "PaLM-E: An Embodied Multimodal Language Model" (2023))) With **2,457 citations**, it is one of the most impactful works bridging foundation models and robotics, demonstrating that a unified model can perform manipulation, navigation, visual question answering, and language tasks without task-specific fine-tuning.
[[https://arxiv.org/abs/2303.03378|arXiv:2303.03378]]
===== Architecture =====
PaLM-E injects continuous sensor observations directly into a pre-trained language model's embedding space. The largest variant combines:((https://arxiv.org/abs/2303.03378|Driess et al. "PaLM-E: An Embodied Multimodal Language Model" (2023)))
* **PaLM-540B**: A 540 billion parameter decoder-only language model((https://arxiv.org/abs/2204.02311|Chowdhery et al. "PaLM: Scaling Language Modeling with Pathways" (2022)))
* **ViT-22B**: A 22 billion parameter Vision Transformer encoder((https://arxiv.org/abs/2302.05442|Dehghani et al. "Scaling Vision Transformers to 22 Billion Parameters" (2023)))
Multimodal inputs are encoded as token-like embeddings that interleave with text tokens:
$$x = [w_1, \ldots, w_n, \phi(o_1), w_{n+1}, \ldots, \phi(o_k), w_m]$$
where $w_i$ are text token embeddings and $\phi(o_j)$ are projected continuous observations (images, robot states, scene features). The projection function $\phi$ maps observations into the language model's embedding dimension via learned linear layers.
==== Input Encoders ====
* **ViT** (Vision Transformer): Encodes images into sequences of visual tokens
* **MLP**: Projects robot state vectors (pose, joint angles, gripper state)
* **Object Scene Representations**: Encodes structured 3D scene features
The model processes **multimodal sentences** with text and sensor data in arbitrary order, generating text outputs that serve as high-level plans for robotic control.((https://palm-e.github.io|PaLM-E Project Page))
===== System Architecture =====
graph TD
A[Camera Images] --> B[ViT-22B Encoder]
C[Robot State] --> D[MLP Encoder]
E[Text Instruction] --> F[Token Embeddings]
B --> G[Projection Layer]
D --> G
G --> H[Interleaved Multimodal Tokens]
F --> H
H --> I[PaLM-540B Decoder]
I --> J[Text Output: Action Plan]
J --> K[Low-Level Policy]
K --> L[Robot Actuators]
L --> M[Environment]
M --> A
M --> C
I --> N[Visual QA Answers]
I --> O[Image Captions]
===== Code Example =====
# Simplified PaLM-E multimodal input construction
import torch
import torch.nn as nn
class PaLME(nn.Module):
def __init__(self, palm_model, vit_encoder, state_dim):
super().__init__()
self.palm = palm_model
self.vit = vit_encoder
self.state_mlp = nn.Sequential(
nn.Linear(state_dim, palm_model.embed_dim),
nn.ReLU(),
nn.Linear(palm_model.embed_dim, palm_model.embed_dim),
)
self.projector = nn.Linear(vit_encoder.dim, palm_model.embed_dim)
def encode_multimodal(self, text, images, robot_state):
text_tokens = self.palm.tokenize(text)
text_embeds = self.palm.embed(text_tokens)
visual_tokens = self.projector(self.vit(images))
state_tokens = self.state_mlp(robot_state).unsqueeze(1)
return torch.cat([text_embeds, visual_tokens, state_tokens], dim=1)
def generate_plan(self, instruction, image, robot_state):
tokens = self.encode_multimodal(instruction, image, robot_state)
plan_text = self.palm.generate(inputs_embeds=tokens)
return parse_actions(plan_text)
def robotic_control_loop(self, task, environment):
while not environment.task_complete():
image = environment.get_camera_image()
state = environment.get_robot_state()
plan = self.generate_plan(task, image, state)
environment.execute(plan[0])
===== Key Results =====
* **State-of-the-art on OK-VQA** without task-specific fine-tuning((https://arxiv.org/abs/2303.03378|Driess et al. "PaLM-E: An Embodied Multimodal Language Model" (2023)))
* Successfully controls real robots for long-horizon manipulation tasks
* **Positive transfer**: Joint training on vision-language and robotics data improves both domains
* Generalizes to **unseen objects** in one-shot and zero-shot settings
* Larger models show emergent capabilities: multimodal chain-of-thought reasoning
===== Model Variants =====
^ Model ^ LLM ^ Vision Encoder ^ Total Parameters ^
| PaLM-E-12B | PaLM-8B | ViT-4B | 12B |
| PaLM-E-84B | PaLM-62B | ViT-22B | 84B |
| PaLM-E-562B | PaLM-540B | ViT-22B | 562B |
===== See Also =====
* [[webgpt|WebGPT: Browser-Assisted Question Answering]]
* [[chemcrow|ChemCrow: LLM Agent with Chemistry Tools]]
* [[reasoning_via_planning|RAP: Reasoning via Planning]]
===== References =====