palm_e

PaLM-E: An Embodied Multimodal Language Model

PaLM-E is a 562 billion parameter embodied multimodal language model introduced by Driess et al. (2023) at Google, combining vision, language, and robotic control in a single end-to-end trained model.¹⁾) With 2,457 citations, it is one of the most impactful works bridging foundation models and robotics, demonstrating that a unified model can perform manipulation, navigation, visual question answering, and language tasks without task-specific fine-tuning.

arXiv:2303.03378

Architecture

PaLM-E injects continuous sensor observations directly into a pre-trained language model's embedding space. The largest variant combines:²⁾)

PaLM-540B: A 540 billion parameter decoder-only language model³⁾)
ViT-22B: A 22 billion parameter Vision Transformer encoder⁴⁾)

Multimodal inputs are encoded as token-like embeddings that interleave with text tokens:

$$x = [w_1, \ldots, w_n, \phi(o_1), w_{n+1}, \ldots, \phi(o_k), w_m]$$

where $w_i$ are text token embeddings and $\phi(o_j)$ are projected continuous observations (images, robot states, scene features). The projection function $\phi$ maps observations into the language model's embedding dimension via learned linear layers.

Input Encoders

ViT (Vision Transformer): Encodes images into sequences of visual tokens
MLP: Projects robot state vectors (pose, joint angles, gripper state)
Object Scene Representations: Encodes structured 3D scene features

The model processes multimodal sentences with text and sensor data in arbitrary order, generating text outputs that serve as high-level plans for robotic control.⁵⁾

System Architecture

graph TD A[Camera Images] --> B[ViT-22B Encoder] C[Robot State] --> D[MLP Encoder] E[Text Instruction] --> F[Token Embeddings] B --> G[Projection Layer] D --> G G --> H[Interleaved Multimodal Tokens] F --> H H --> I[PaLM-540B Decoder] I --> J[Text Output: Action Plan] J --> K[Low-Level Policy] K --> L[Robot Actuators] L --> M[Environment] M --> A M --> C I --> N[Visual QA Answers] I --> O[Image Captions]

Code Example

# Simplified PaLM-E multimodal input construction
import torch
import torch.nn as nn
 
class PaLME(nn.Module):
    def __init__(self, palm_model, vit_encoder, state_dim):
        super().__init__()
        self.palm = palm_model
        self.vit = vit_encoder
        self.state_mlp = nn.Sequential(
            nn.Linear(state_dim, palm_model.embed_dim),
            nn.ReLU(),
            nn.Linear(palm_model.embed_dim, palm_model.embed_dim),
        )
        self.projector = nn.Linear(vit_encoder.dim, palm_model.embed_dim)
 
    def encode_multimodal(self, text, images, robot_state):
        text_tokens = self.palm.tokenize(text)
        text_embeds = self.palm.embed(text_tokens)
        visual_tokens = self.projector(self.vit(images))
        state_tokens = self.state_mlp(robot_state).unsqueeze(1)
        return torch.cat([text_embeds, visual_tokens, state_tokens], dim=1)
 
    def generate_plan(self, instruction, image, robot_state):
        tokens = self.encode_multimodal(instruction, image, robot_state)
        plan_text = self.palm.generate(inputs_embeds=tokens)
        return parse_actions(plan_text)
 
    def robotic_control_loop(self, task, environment):
        while not environment.task_complete():
            image = environment.get_camera_image()
            state = environment.get_robot_state()
            plan = self.generate_plan(task, image, state)
            environment.execute(plan[0])

Key Results

State-of-the-art on OK-VQA without task-specific fine-tuning⁶⁾)
Successfully controls real robots for long-horizon manipulation tasks
Positive transfer: Joint training on vision-language and robotics data improves both domains
Generalizes to unseen objects in one-shot and zero-shot settings
Larger models show emergent capabilities: multimodal chain-of-thought reasoning

Model	LLM	Vision Encoder	Total Parameters
PaLM-E-12B	PaLM-8B	ViT-4B	12B
PaLM-E-84B	PaLM-62B	ViT-22B	84B
PaLM-E-562B	PaLM-540B	ViT-22B	562B

Model

LLM

Vision Encoder

Total Parameters

PaLM-E-12B

PaLM-8B

ViT-4B

12B

PaLM-E-84B

PaLM-62B

ViT-22B

84B

PaLM-E-562B

PaLM-540B

ViT-22B

562B

References

¹⁾ , ²⁾ , ⁶⁾

https://arxiv.org/abs/2303.03378|Driess et al. “PaLM-E: An Embodied Multimodal Language Model” (2023

³⁾

https://arxiv.org/abs/2204.02311|Chowdhery et al. “PaLM: Scaling Language Modeling with Pathways” (2022

⁴⁾

https://arxiv.org/abs/2302.05442|Dehghani et al. “Scaling Vision Transformers to 22 Billion Parameters” (2023

⁵⁾

https://palm-e.github.io|PaLM-E Project Page

Table of Contents