Inner Monologue: Embodied Reasoning with Language Models

Inner Monologue is a framework introduced by Huang et al. at Google Robotics in 2022 that enables LLM-based embodied agents to perform closed-loop planning by incorporating natural language feedback from the environment. By maintaining a continuous “inner monologue” of textual observations and feedback, the LLM can detect failures, adapt plans, and recover from errors without any additional training.

Overview

Prior work like SayCan demonstrated that LLMs can generate plausible action sequences for robots, but operated in an open-loop fashion – once a plan was generated, it was executed without adaptation. Inner Monologue closes this loop by feeding environment observations back to the LLM as natural language, enabling it to reason about what went wrong and how to adjust. This transforms the LLM from a static planner into a dynamic replanning agent.

Sources of Language Feedback

Inner Monologue investigates three complementary types of feedback, all expressed in natural language:

Success Detection: After each primitive action, a success detector provides binary feedback (“success” or “failure”) indicating whether the action achieved its intended effect. This is the minimal feedback needed for replanning.
Passive Scene Description: An object recognition or scene description module (e.g., a vision-language model) provides a textual summary of the current environment state. This gives the LLM grounding in what objects are present and their spatial relationships.
Active Scene Description: The LLM can query the environment by asking specific questions (e.g., “Is the red block on the table?”), receiving targeted responses. This enables hypothesis-driven exploration.

Additionally, human interaction serves as a fourth feedback channel where a human can provide corrections, clarifications, or new instructions mid-execution.

Closed-Loop Architecture

The system operates as a continuous loop:

$$\text{Instruction} \xrightarrow{\text{LLM}} \text{Action} \xrightarrow{\text{Execute}} \text{Environment} \xrightarrow{\text{Perceive}} \text{Feedback} \xrightarrow{\text{LLM}} \text{Next Action}$$

At each step $t$, the LLM receives the full history of actions and feedback as its “inner monologue”:

$$a_t = \text{LLM}(I, a_1, f_1, a_2, f_2, \ldots, a_{t-1}, f_{t-1})$$

where $I$ is the high-level instruction, $a_i$ are actions, and $f_i$ are feedback strings. This history provides the LLM with a running narrative of the task execution, enabling it to reason about the current state and decide the next action.

The key insight is that all components communicate through natural language, requiring no specialized interfaces or additional model training. The LLM's pre-trained reasoning capabilities are sufficient to interpret feedback and adjust plans.

Code Example

import openai
 
class InnerMonologueAgent:
    def __init__(self, client, skills, perceiver, success_detector):
        self.client = client
        self.skills = skills          # available robot primitives
        self.perceiver = perceiver    # scene description module
        self.detector = success_detector
 
    def execute_task(self, instruction, max_steps=20):
        monologue = [f"Task: {instruction}"]
        scene = self.perceiver.describe_scene()
        monologue.append(f"Scene: {scene}")
 
        for step in range(max_steps):
            # LLM plans next action based on full monologue
            prompt = (
                f"Available skills: {', '.join(self.skills)}\n"
                f"{'\n'.join(monologue)}\n"
                "What is the next action? Reply with a skill name and "
                "parameters, or 'done' if the task is complete."
            )
            action = self.client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}]
            ).choices[0].message.content
 
            if "done" in action.lower():
                monologue.append("Task completed.")
                break
 
            monologue.append(f"Action: {action}")
 
            # Execute and get feedback
            success = self.detector.check(action)
            scene = self.perceiver.describe_scene()
 
            monologue.append(f"Success: {success}")
            monologue.append(f"Scene: {scene}")
 
            if not success:
                monologue.append("Replanning due to failure...")
 
        return monologue

Experimental Domains

Inner Monologue was evaluated across three domains of increasing complexity:

Simulated Tabletop Rearrangement: Pick-and-place tasks in simulation where the agent must arrange objects according to language instructions. Closed-loop feedback significantly improves task completion over open-loop baselines.
Real Robot Tabletop Tasks: Physical robot manipulation tasks with real objects, demonstrating transfer from simulation and robustness to real-world noise.
Long-Horizon Kitchen Manipulation: A mobile manipulator performing multi-step tasks in a real kitchen environment (e.g., “put the soda in the drawer”). These tasks involve navigation, grasping, and multi-object interaction over long horizons.

Across all domains, closed-loop language feedback significantly improves high-level instruction completion compared to open-loop planning.

Comparison with SayCan

Aspect	SayCan	Inner Monologue
Planning	Open-loop: plan once, execute	Closed-loop: continuous replanning
Feedback	Affordance scoring only	Rich language feedback (success, scene, queries)
Adaptation	No runtime adaptation	Detects failures and replans dynamically
Grounding	Value functions for affordance	Language-based perception modules
Training	Requires learned value functions	No additional training required

Inner Monologue can be viewed as complementary to SayCan: SayCan grounds action selection in physical affordances, while Inner Monologue adds the ability to recover from errors and adapt to unexpected situations.

AI Agent Knowledge Base

Sidebar

Table of Contents

Inner Monologue: Embodied Reasoning with Language Models

Overview

Sources of Language Feedback

Closed-Loop Architecture

Code Example

Experimental Domains

Comparison with SayCan

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Inner Monologue: Embodied Reasoning with Language Models

Overview

Sources of Language Feedback

Closed-Loop Architecture

Code Example

Experimental Domains

Comparison with SayCan

References

See Also

Page Tools