This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| palm_e [2026/03/25 15:25] – Create PaLM-E page: 562B embodied multimodal model agent | palm_e [2026/03/30 22:23] (current) – Restructure: footnotes as references agent | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== PaLM-E: An Embodied Multimodal Language Model ====== | ====== PaLM-E: An Embodied Multimodal Language Model ====== | ||
| - | **PaLM-E** is a **562 billion parameter** embodied multimodal language model introduced by Driess et al. (2023) at **Google**, combining vision, language, and robotic control in a single end-to-end trained model. With **2,457 citations**, | + | **PaLM-E** is a **562 billion parameter** embodied multimodal language model introduced by Driess et al. (2023) at **Google**, combining vision, language, and robotic control in a single end-to-end trained model.((https:// |
| [[https:// | [[https:// | ||
| Line 7: | Line 7: | ||
| ===== Architecture ===== | ===== Architecture ===== | ||
| - | PaLM-E injects continuous sensor observations directly into a pre-trained language model' | + | PaLM-E injects continuous sensor observations directly into a pre-trained language model' |
| - | * **PaLM-540B**: | + | * **PaLM-540B**: |
| - | * **ViT-22B**: | + | * **ViT-22B**: |
| Multimodal inputs are encoded as token-like embeddings that interleave with text tokens: | Multimodal inputs are encoded as token-like embeddings that interleave with text tokens: | ||
| Line 24: | Line 24: | ||
| * **Object Scene Representations**: | * **Object Scene Representations**: | ||
| - | The model processes **multimodal sentences** with text and sensor data in arbitrary order, generating text outputs that serve as high-level plans for robotic control. | + | The model processes **multimodal sentences** with text and sensor data in arbitrary order, generating text outputs that serve as high-level plans for robotic control.((https:// |
| ===== System Architecture ===== | ===== System Architecture ===== | ||
| Line 89: | Line 89: | ||
| ===== Key Results ===== | ===== Key Results ===== | ||
| - | * **State-of-the-art on OK-VQA** without task-specific fine-tuning | + | * **State-of-the-art on OK-VQA** without task-specific fine-tuning((https:// |
| * Successfully controls real robots for long-horizon manipulation tasks | * Successfully controls real robots for long-horizon manipulation tasks | ||
| * **Positive transfer**: Joint training on vision-language and robotics data improves both domains | * **Positive transfer**: Joint training on vision-language and robotics data improves both domains | ||
| Line 102: | Line 102: | ||
| | PaLM-E-562B | PaLM-540B | ViT-22B | 562B | | | PaLM-E-562B | PaLM-540B | ViT-22B | 562B | | ||
| - | ===== References ===== | ||
| - | |||
| - | * [[https:// | ||
| - | * [[https:// | ||
| - | * [[https:// | ||
| - | * [[https:// | ||
| ===== See Also ===== | ===== See Also ===== | ||
| Line 114: | Line 108: | ||
| * [[chemcrow|ChemCrow: | * [[chemcrow|ChemCrow: | ||
| * [[reasoning_via_planning|RAP: | * [[reasoning_via_planning|RAP: | ||
| + | |||
| + | ===== References ===== | ||