Table of Contents

PaLM-E: An Embodied Multimodal Language Model

PaLM-E is a 562 billion parameter embodied multimodal language model introduced by Driess et al. (2023) at Google, combining vision, language, and robotic control in a single end-to-end trained model.1)) With 2,457 citations, it is one of the most impactful works bridging foundation models and robotics, demonstrating that a unified model can perform manipulation, navigation, visual question answering, and language tasks without task-specific fine-tuning.

arXiv:2303.03378

Architecture

PaLM-E injects continuous sensor observations directly into a pre-trained language model's embedding space. The largest variant combines:2))

Multimodal inputs are encoded as token-like embeddings that interleave with text tokens:

$$x = [w_1, \ldots, w_n, \phi(o_1), w_{n+1}, \ldots, \phi(o_k), w_m]$$

where $w_i$ are text token embeddings and $\phi(o_j)$ are projected continuous observations (images, robot states, scene features). The projection function $\phi$ maps observations into the language model's embedding dimension via learned linear layers.

Input Encoders

The model processes multimodal sentences with text and sensor data in arbitrary order, generating text outputs that serve as high-level plans for robotic control.5)

System Architecture

graph TD A[Camera Images] --> B[ViT-22B Encoder] C[Robot State] --> D[MLP Encoder] E[Text Instruction] --> F[Token Embeddings] B --> G[Projection Layer] D --> G G --> H[Interleaved Multimodal Tokens] F --> H H --> I[PaLM-540B Decoder] I --> J[Text Output: Action Plan] J --> K[Low-Level Policy] K --> L[Robot Actuators] L --> M[Environment] M --> A M --> C I --> N[Visual QA Answers] I --> O[Image Captions]

Code Example

# Simplified PaLM-E multimodal input construction
import torch
import torch.nn as nn
 
class PaLME(nn.Module):
    def __init__(self, palm_model, vit_encoder, state_dim):
        super().__init__()
        self.palm = palm_model
        self.vit = vit_encoder
        self.state_mlp = nn.Sequential(
            nn.Linear(state_dim, palm_model.embed_dim),
            nn.ReLU(),
            nn.Linear(palm_model.embed_dim, palm_model.embed_dim),
        )
        self.projector = nn.Linear(vit_encoder.dim, palm_model.embed_dim)
 
    def encode_multimodal(self, text, images, robot_state):
        text_tokens = self.palm.tokenize(text)
        text_embeds = self.palm.embed(text_tokens)
        visual_tokens = self.projector(self.vit(images))
        state_tokens = self.state_mlp(robot_state).unsqueeze(1)
        return torch.cat([text_embeds, visual_tokens, state_tokens], dim=1)
 
    def generate_plan(self, instruction, image, robot_state):
        tokens = self.encode_multimodal(instruction, image, robot_state)
        plan_text = self.palm.generate(inputs_embeds=tokens)
        return parse_actions(plan_text)
 
    def robotic_control_loop(self, task, environment):
        while not environment.task_complete():
            image = environment.get_camera_image()
            state = environment.get_robot_state()
            plan = self.generate_plan(task, image, state)
            environment.execute(plan[0])

Key Results

Model Variants

Model LLM Vision Encoder Total Parameters
PaLM-E-12B PaLM-8B ViT-4B 12B
PaLM-E-84B PaLM-62B ViT-22B 84B
PaLM-E-562B PaLM-540B ViT-22B 562B

See Also

References

1) , 2) , 6)
https://arxiv.org/abs/2303.03378|Driess et al. “PaLM-E: An Embodied Multimodal Language Model” (2023
3)
https://arxiv.org/abs/2204.02311|Chowdhery et al. “PaLM: Scaling Language Modeling with Pathways” (2022
4)
https://arxiv.org/abs/2302.05442|Dehghani et al. “Scaling Vision Transformers to 22 Billion Parameters” (2023
5)
https://palm-e.github.io|PaLM-E Project Page