This is an old revision of the document!

PaLM-E: An Embodied Multimodal Language Model

PaLM-E is a 562 billion parameter embodied multimodal language model introduced by Driess et al. (2023) at Google, combining vision, language, and robotic control in a single end-to-end trained model.¹⁾) With 2,457 citations, it is one of the most impactful works bridging foundation models and robotics, demonstrating that a unified model can perform manipulation, navigation, visual question answering, and language tasks without task-specific fine-tuning.

arXiv:2303.03378

Architecture

PaLM-E injects continuous sensor observations directly into a pre-trained language model's embedding space. The largest variant combines:²⁾)

PaLM-540B: A 540 billion parameter decoder-only language model
ViT-22B: A 22 billion parameter Vision Transformer encoder³⁾)

Multimodal inputs are encoded as token-like embeddings that interleave with text tokens:

$$x = [w_1, \ldots, w_n, \phi(o_1), w_{n+1}, \ldots, \phi(o_k), w_m]$$

where $w_i$ are text token embeddings and $\phi(o_j)$ are projected continuous observations (images, robot states, scene features). The projection function $\phi$ maps observations into the language model's embedding dimension via learned linear layers.

Input Encoders

ViT (Vision Transformer): Encodes images into sequences of visual tokens
MLP: Projects robot state vectors (pose, joint angles, gripper state)
Object Scene Representations: Encodes structured 3D scene features

The model processes multimodal sentences with text and sensor data in arbitrary order, generating text outputs that serve as high-level plans for robotic control.⁴⁾

System Architecture

graph TD A[Camera Images] --> B[ViT-22B Encoder] C[Robot State] --> D[MLP Encoder] E[Text Instruction] --> F[Token Embeddings] B --> G[Projection Layer] D --> G G --> H[Interleaved Multimodal Tokens] F --> H H --> I[PaLM-540B Decoder] I --> J[Text Output: Action Plan] J --> K[Low-Level Policy] K --> L[Robot Actuators] L --> M[Environment] M --> A M --> C I --> N[Visual QA Answers] I --> O[Image Captions]

Code Example

# Simplified PaLM-E multimodal input construction
import torch
import torch.nn as nn
 
class PaLME(nn.Module):
    def __init__(self, palm_model, vit_encoder, state_dim):
        super().__init__()
        self.palm = palm_model
        self.vit = vit_encoder
        self.state_mlp = nn.Sequential(
            nn.Linear(state_dim, palm_model.embed_dim),
            nn.ReLU(),
            nn.Linear(palm_model.embed_dim, palm_model.embed_dim),
        )
        self.projector = nn.Linear(vit_encoder.dim, palm_model.embed_dim)
 
    def encode_multimodal(self, text, images, robot_state):
        text_tokens = self.palm.tokenize(text)
        text_embeds = self.palm.embed(text_tokens)
        visual_tokens = self.projector(self.vit(images))
        state_tokens = self.state_mlp(robot_state).unsqueeze(1)
        return torch.cat([text_embeds, visual_tokens, state_tokens], dim=1)
 
    def generate_plan(self, instruction, image, robot_state):
        tokens = self.encode_multimodal(instruction, image, robot_state)
        plan_text = self.palm.generate(inputs_embeds=tokens)
        return parse_actions(plan_text)
 
    def robotic_control_loop(self, task, environment):
        while not environment.task_complete():
            image = environment.get_camera_image()
            state = environment.get_robot_state()
            plan = self.generate_plan(task, image, state)
            environment.execute(plan[0])

Key Results

State-of-the-art on OK-VQA without task-specific fine-tuning⁵⁾)
Successfully controls real robots for long-horizon manipulation tasks
Positive transfer: Joint training on vision-language and robotics data improves both domains
Generalizes to unseen objects in one-shot and zero-shot settings
Larger models show emergent capabilities: multimodal chain-of-thought reasoning

Model Variants

Model	LLM	Vision Encoder	Total Parameters
PaLM-E-12B	PaLM-8B	ViT-4B	12B
PaLM-E-84B	PaLM-62B	ViT-22B	84B
PaLM-E-562B	PaLM-540B	ViT-22B	562B

AI Agent Knowledge Base

Sidebar

Table of Contents

PaLM-E: An Embodied Multimodal Language Model

Architecture

Input Encoders

System Architecture

Code Example

Key Results

Model Variants

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

PaLM-E: An Embodied Multimodal Language Model

Architecture

Input Encoders

System Architecture

Code Example

Key Results

Model Variants

References

See Also

Page Tools