====== Robotic Manipulation Agents ====== LLM-driven closed-loop robotic control systems deploy multi-agent architectures where language models plan, generate executable code, and adapt via visual feedback to achieve zero-shot manipulation of novel objects in dynamic environments.((https://arxiv.org/abs/2601.19510)) ===== Overview ===== Traditional robotic manipulation relies on task-specific policies trained through extensive demonstration or reinforcement learning. LLM-based agents bypass this by leveraging pre-trained language understanding for zero-shot or few-shot task execution. Three systems exemplify this approach: ALRM uses multi-agent LLMs for zero-shot manipulation with planner-coder-supervisor roles,((https://arxiv.org/abs/2601.19510)) ManiAgent employs agentic skill selection with VLM-grounded replanning,((https://arxiv.org/abs/2510.11660)) and RoboClaw introduces multi-robot coordination through LLM-orchestrated dialogue.((https://arxiv.org/abs/2603.11558)) ===== Closed-Loop Control Architecture ===== The fundamental closed-loop pattern across all systems follows: observe, reason, act, observe: \text{Loop}: s_t \xrightarrow{\text{VLM}} z_t \xrightarrow{\text{LLM}} a_t \xrightarrow{\text{Robot}} s_{t+1} where $s_t$ is the sensory state (images, proprioception), $z_t$ is the language-grounded scene description, $a_t$ is the planned action, and $s_{t+1}$ is the resulting state. The loop continues until task completion or failure detection triggers replanning. ===== ALRM: Multi-Agent Zero-Shot Manipulation ===== ALRM (Agent-based LLM Robotic Manipulation) deploys a multi-agent architecture:((https://arxiv.org/abs/2601.19510)) * **Planner Agent**: Decomposes high-level tasks into sub-tasks via prompting (e.g., "stack blocks" becomes a sequence of pick, place, and verify operations) * **Coder Agent**: Translates sub-tasks into robot-executable code (e.g., PyBullet or ROS commands) * **Supervisor Agent**: Monitors execution, processes VLM feedback on state changes, and triggers replanning on failure **Zero-shot capability**: No pre-trained manipulation policies are required. The LLM's in-context learning enables immediate deployment on unseen tasks. **Failure recovery**: After each action, VLMs detect the state (e.g., bounding boxes, object positions); the supervisor integrates feedback (e.g., "grasp failed") to prompt re-execution. ===== ManiAgent: Agentic Skill Selection ===== ManiAgent uses an LLM to dynamically select and parameterize manipulation primitives:((https://arxiv.org/abs/2510.11660)) a_t = \text{LLM}(\text{skill\_library}, s_t, g) = (\text{skill}_k, \theta_k) where $\text{skill}_k$ is selected from a library of primitives (grasp, place, push, rotate) and $\theta_k$ are the skill parameters (target pose, force, speed). The VLM-grounded replanning loop adapts to unexpected outcomes: s_{t+1} \neq s_{expected} \implies \text{replan}(s_{t+1}, g) ===== RoboClaw: Multi-Robot Coordination ===== RoboClaw orchestrates multiple robots through LLM-mediated dialogue:((https://arxiv.org/abs/2603.11558)) * **Task Allocator**: LLM distributes sub-tasks to available robots based on capabilities * **Dialogue Protocol**: Robots communicate shared observations via structured messages * **Collective Reasoning**: Group deliberation on shared state enables dynamic task reassignment \text{Assignment}^* = \arg\min_{A} \sum_{i} C(r_i, \text{task}_{A(i)}) \quad \text{s.t.} \quad \text{capabilities}(r_i) \supseteq \text{req}(\text{task}_{A(i)}) where $C(r_i, \text{task})$ is the estimated cost of robot $r_i$ performing the task, and the assignment respects capability constraints. ===== Code Example ===== from dataclasses import dataclass import numpy as np @dataclass class RobotState: joint_positions: np.ndarray gripper_open: bool ee_pose: np.ndarray # end-effector position + orientation @dataclass class SceneObject: name: str bbox: np.ndarray position: np.ndarray graspable: bool class RoboticManipAgent: def __init__(self, llm, vlm, robot_controller): self.llm = llm self.vlm = vlm self.robot = robot_controller def observe(self, camera_image: np.ndarray) -> list[SceneObject]: description = self.vlm.describe_scene(camera_image) objects = self.vlm.detect_objects(camera_image) return objects def plan(self, task: str, objects: list[SceneObject], robot_state: RobotState) -> list[dict]: scene_desc = "\n".join( f"- {o.name} at {o.position.tolist()}, " f"graspable={o.graspable}" for o in objects ) plan = self.llm.generate( f"Task: {task}\n" f"Scene:\n{scene_desc}\n" f"Robot EE at: {robot_state.ee_pose.tolist()}\n" f"Plan sequential manipulation steps:" ) return self.parse_plan(plan) def execute_with_feedback(self, task: str, max_retries: int = 3) -> bool: for attempt in range(max_retries): image = self.robot.capture_image() objects = self.observe(image) plan = self.plan(task, objects, self.robot.get_state()) for step in plan: code = self.generate_robot_code(step) self.robot.execute(code) new_image = self.robot.capture_image() new_objects = self.observe(new_image) if not self.verify_step(step, new_objects): break else: return True return False def generate_robot_code(self, step: dict) -> str: return self.llm.generate( f"Generate ROS2 code for: {step}\n" f"Use MoveIt2 for motion planning." ) ===== Architecture ===== graph TD A[Task Instruction] --> B[Planner Agent] B --> C[Sub-Task Sequence] C --> D[Coder Agent] D --> E[Robot-Executable Code] E --> F[Robot Controller] F --> G[Action Execution] G --> H[Camera Observation] H --> I[VLM Scene Analysis] I --> J[Supervisor Agent] J --> K{Step Successful?} K -->|Yes| L{Task Complete?} K -->|No| M[Failure Analysis] M --> N{Replanning Needed?} N -->|Code Fix| D N -->|Plan Change| B L -->|Yes| O[Task Success] L -->|No| C subgraph Multi-Robot - RoboClaw P[Task Allocator] --> Q[Robot A] P --> R[Robot B] Q <--> S[Dialogue Protocol] R <--> S S --> T[Collective Reasoning] T --> P end ===== Key Results ===== ^ System ^ Benchmark ^ vs. Baseline ^ Key Strength ^ | ALRM | RLBench long-horizon | +20% success vs. CAP/VoxPoser | Zero-shot, no pre-trained policies | | ALRM | Real-world Franka | Validated on unseen tasks | Sim-to-real transfer | | ManiAgent | Manipulation benchmarks | High autonomy | Novel object generalization | | RoboClaw | Multi-robot coordination | Emergent collaborative behaviors | Dynamic task reassignment | ===== See Also ===== * [[game_playing_agents|Game Playing Agents]] * [[image_editing_agents|Image Editing Agents]] * [[devops_incident_agents|DevOps Incident Agents]] ===== References =====