Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Inner Monologue is a framework introduced by Huang et al. at Google Robotics in 2022 that enables LLM-based embodied agents to perform closed-loop planning by incorporating natural language feedback from the environment. By maintaining a continuous “inner monologue” of textual observations and feedback, the LLM can detect failures, adapt plans, and recover from errors without any additional training.
Prior work like SayCan demonstrated that LLMs can generate plausible action sequences for robots, but operated in an open-loop fashion – once a plan was generated, it was executed without adaptation. Inner Monologue closes this loop by feeding environment observations back to the LLM as natural language, enabling it to reason about what went wrong and how to adjust. This transforms the LLM from a static planner into a dynamic replanning agent.
Inner Monologue investigates three complementary types of feedback, all expressed in natural language:
Additionally, human interaction serves as a fourth feedback channel where a human can provide corrections, clarifications, or new instructions mid-execution.
The system operates as a continuous loop:
$$\text{Instruction} \xrightarrow{\text{LLM}} \text{Action} \xrightarrow{\text{Execute}} \text{Environment} \xrightarrow{\text{Perceive}} \text{Feedback} \xrightarrow{\text{LLM}} \text{Next Action}$$
At each step $t$, the LLM receives the full history of actions and feedback as its “inner monologue”:
$$a_t = \text{LLM}(I, a_1, f_1, a_2, f_2, \ldots, a_{t-1}, f_{t-1})$$
where $I$ is the high-level instruction, $a_i$ are actions, and $f_i$ are feedback strings. This history provides the LLM with a running narrative of the task execution, enabling it to reason about the current state and decide the next action.
The key insight is that all components communicate through natural language, requiring no specialized interfaces or additional model training. The LLM's pre-trained reasoning capabilities are sufficient to interpret feedback and adjust plans.
import openai class InnerMonologueAgent: def __init__(self, client, skills, perceiver, success_detector): self.client = client self.skills = skills # available robot primitives self.perceiver = perceiver # scene description module self.detector = success_detector def execute_task(self, instruction, max_steps=20): monologue = [f"Task: {instruction}"] scene = self.perceiver.describe_scene() monologue.append(f"Scene: {scene}") for step in range(max_steps): # LLM plans next action based on full monologue prompt = ( f"Available skills: {', '.join(self.skills)}\n" f"{'\n'.join(monologue)}\n" "What is the next action? Reply with a skill name and " "parameters, or 'done' if the task is complete." ) action = self.client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ).choices[0].message.content if "done" in action.lower(): monologue.append("Task completed.") break monologue.append(f"Action: {action}") # Execute and get feedback success = self.detector.check(action) scene = self.perceiver.describe_scene() monologue.append(f"Success: {success}") monologue.append(f"Scene: {scene}") if not success: monologue.append("Replanning due to failure...") return monologue
Inner Monologue was evaluated across three domains of increasing complexity:
Across all domains, closed-loop language feedback significantly improves high-level instruction completion compared to open-loop planning.
| Aspect | SayCan | Inner Monologue |
|---|---|---|
| Planning | Open-loop: plan once, execute | Closed-loop: continuous replanning |
| Feedback | Affordance scoring only | Rich language feedback (success, scene, queries) |
| Adaptation | No runtime adaptation | Detects failures and replans dynamically |
| Grounding | Value functions for affordance | Language-based perception modules |
| Training | Requires learned value functions | No additional training required |
Inner Monologue can be viewed as complementary to SayCan: SayCan grounds action selection in physical affordances, while Inner Monologue adds the ability to recover from errors and adapt to unexpected situations.