Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
LLM-driven closed-loop robotic control systems deploy multi-agent architectures where language models plan, generate executable code, and adapt via visual feedback to achieve zero-shot manipulation of novel objects in dynamic environments.
Traditional robotic manipulation relies on task-specific policies trained through extensive demonstration or reinforcement learning. LLM-based agents bypass this by leveraging pre-trained language understanding for zero-shot or few-shot task execution. Three systems exemplify this approach: ALRM uses multi-agent LLMs for zero-shot manipulation with planner-coder-supervisor roles, ManiAgent employs agentic skill selection with VLM-grounded replanning, and RoboClaw introduces multi-robot coordination through LLM-orchestrated dialogue.
The fundamental closed-loop pattern across all systems follows: observe, reason, act, observe:
<latex>\text{Loop}: s_t \xrightarrow{\text{VLM}} z_t \xrightarrow{\text{LLM}} a_t \xrightarrow{\text{Robot}} s_{t+1}</latex>
where $s_t$ is the sensory state (images, proprioception), $z_t$ is the language-grounded scene description, $a_t$ is the planned action, and $s_{t+1}$ is the resulting state. The loop continues until task completion or failure detection triggers replanning.
ALRM (Agent-based LLM Robotic Manipulation) deploys a multi-agent architecture:
Zero-shot capability: No pre-trained manipulation policies are required. The LLM's in-context learning enables immediate deployment on unseen tasks.
Failure recovery: After each action, VLMs detect the state (e.g., bounding boxes, object positions); the supervisor integrates feedback (e.g., “grasp failed”) to prompt re-execution.
ManiAgent uses an LLM to dynamically select and parameterize manipulation primitives:
<latex>a_t = \text{LLM}(\text{skill\_library}, s_t, g) = (\text{skill}_k, \theta_k)</latex>
where $\text{skill}_k$ is selected from a library of primitives (grasp, place, push, rotate) and $\theta_k$ are the skill parameters (target pose, force, speed).
The VLM-grounded replanning loop adapts to unexpected outcomes:
<latex>s_{t+1} \neq s_{expected} \implies \text{replan}(s_{t+1}, g)</latex>
RoboClaw orchestrates multiple robots through LLM-mediated dialogue:
<latex>\text{Assignment}^* = \arg\min_{A} \sum_{i} C(r_i, \text{task}_{A(i)}) \quad \text{s.t.} \quad \text{capabilities}(r_i) \supseteq \text{req}(\text{task}_{A(i)})</latex>
where $C(r_i, \text{task})$ is the estimated cost of robot $r_i$ performing the task, and the assignment respects capability constraints.
from dataclasses import dataclass import numpy as np @dataclass class RobotState: joint_positions: np.ndarray gripper_open: bool ee_pose: np.ndarray # end-effector position + orientation @dataclass class SceneObject: name: str bbox: np.ndarray position: np.ndarray graspable: bool class RoboticManipAgent: def __init__(self, llm, vlm, robot_controller): self.llm = llm self.vlm = vlm self.robot = robot_controller def observe(self, camera_image: np.ndarray) -> list[SceneObject]: description = self.vlm.describe_scene(camera_image) objects = self.vlm.detect_objects(camera_image) return objects def plan(self, task: str, objects: list[SceneObject], robot_state: RobotState) -> list[dict]: scene_desc = "\n".join( f"- {o.name} at {o.position.tolist()}, " f"graspable={o.graspable}" for o in objects ) plan = self.llm.generate( f"Task: {task}\n" f"Scene:\n{scene_desc}\n" f"Robot EE at: {robot_state.ee_pose.tolist()}\n" f"Plan sequential manipulation steps:" ) return self.parse_plan(plan) def execute_with_feedback(self, task: str, max_retries: int = 3) -> bool: for attempt in range(max_retries): image = self.robot.capture_image() objects = self.observe(image) plan = self.plan(task, objects, self.robot.get_state()) for step in plan: code = self.generate_robot_code(step) self.robot.execute(code) new_image = self.robot.capture_image() new_objects = self.observe(new_image) if not self.verify_step(step, new_objects): break else: return True return False def generate_robot_code(self, step: dict) -> str: return self.llm.generate( f"Generate ROS2 code for: {step}\n" f"Use MoveIt2 for motion planning." )
| System | Benchmark | vs. Baseline | Key Strength |
|---|---|---|---|
| ALRM | RLBench long-horizon | +20% success vs. CAP/VoxPoser | Zero-shot, no pre-trained policies |
| ALRM | Real-world Franka | Validated on unseen tasks | Sim-to-real transfer |
| ManiAgent | Manipulation benchmarks | High autonomy | Novel object generalization |
| RoboClaw | Multi-robot coordination | Emergent collaborative behaviors | Dynamic task reassignment |