====== Robotic Manipulation Agents ======
LLM-driven closed-loop robotic control systems deploy multi-agent architectures where language models plan, generate executable code, and adapt via visual feedback to achieve zero-shot manipulation of novel objects in dynamic environments.((https://arxiv.org/abs/2601.19510))
===== Overview =====
Traditional robotic manipulation relies on task-specific policies trained through extensive demonstration or reinforcement learning. LLM-based agents bypass this by leveraging pre-trained language understanding for zero-shot or few-shot task execution. Three systems exemplify this approach: ALRM uses multi-agent LLMs for zero-shot manipulation with planner-coder-supervisor roles,((https://arxiv.org/abs/2601.19510)) ManiAgent employs agentic skill selection with VLM-grounded replanning,((https://arxiv.org/abs/2510.11660)) and RoboClaw introduces multi-robot coordination through LLM-orchestrated dialogue.((https://arxiv.org/abs/2603.11558))
===== Closed-Loop Control Architecture =====
The fundamental closed-loop pattern across all systems follows: observe, reason, act, observe:
\text{Loop}: s_t \xrightarrow{\text{VLM}} z_t \xrightarrow{\text{LLM}} a_t \xrightarrow{\text{Robot}} s_{t+1}
where $s_t$ is the sensory state (images, proprioception), $z_t$ is the language-grounded scene description, $a_t$ is the planned action, and $s_{t+1}$ is the resulting state. The loop continues until task completion or failure detection triggers replanning.
===== ALRM: Multi-Agent Zero-Shot Manipulation =====
ALRM (Agent-based LLM Robotic Manipulation) deploys a multi-agent architecture:((https://arxiv.org/abs/2601.19510))
* **Planner Agent**: Decomposes high-level tasks into sub-tasks via prompting (e.g., "stack blocks" becomes a sequence of pick, place, and verify operations)
* **Coder Agent**: Translates sub-tasks into robot-executable code (e.g., PyBullet or ROS commands)
* **Supervisor Agent**: Monitors execution, processes VLM feedback on state changes, and triggers replanning on failure
**Zero-shot capability**: No pre-trained manipulation policies are required. The LLM's in-context learning enables immediate deployment on unseen tasks.
**Failure recovery**: After each action, VLMs detect the state (e.g., bounding boxes, object positions); the supervisor integrates feedback (e.g., "grasp failed") to prompt re-execution.
===== ManiAgent: Agentic Skill Selection =====
ManiAgent uses an LLM to dynamically select and parameterize manipulation primitives:((https://arxiv.org/abs/2510.11660))
a_t = \text{LLM}(\text{skill\_library}, s_t, g) = (\text{skill}_k, \theta_k)
where $\text{skill}_k$ is selected from a library of primitives (grasp, place, push, rotate) and $\theta_k$ are the skill parameters (target pose, force, speed).
The VLM-grounded replanning loop adapts to unexpected outcomes:
s_{t+1} \neq s_{expected} \implies \text{replan}(s_{t+1}, g)
===== RoboClaw: Multi-Robot Coordination =====
RoboClaw orchestrates multiple robots through LLM-mediated dialogue:((https://arxiv.org/abs/2603.11558))
* **Task Allocator**: LLM distributes sub-tasks to available robots based on capabilities
* **Dialogue Protocol**: Robots communicate shared observations via structured messages
* **Collective Reasoning**: Group deliberation on shared state enables dynamic task reassignment
\text{Assignment}^* = \arg\min_{A} \sum_{i} C(r_i, \text{task}_{A(i)}) \quad \text{s.t.} \quad \text{capabilities}(r_i) \supseteq \text{req}(\text{task}_{A(i)})
where $C(r_i, \text{task})$ is the estimated cost of robot $r_i$ performing the task, and the assignment respects capability constraints.
===== Code Example =====
from dataclasses import dataclass
import numpy as np
@dataclass
class RobotState:
joint_positions: np.ndarray
gripper_open: bool
ee_pose: np.ndarray # end-effector position + orientation
@dataclass
class SceneObject:
name: str
bbox: np.ndarray
position: np.ndarray
graspable: bool
class RoboticManipAgent:
def __init__(self, llm, vlm, robot_controller):
self.llm = llm
self.vlm = vlm
self.robot = robot_controller
def observe(self, camera_image: np.ndarray) -> list[SceneObject]:
description = self.vlm.describe_scene(camera_image)
objects = self.vlm.detect_objects(camera_image)
return objects
def plan(self, task: str, objects: list[SceneObject],
robot_state: RobotState) -> list[dict]:
scene_desc = "\n".join(
f"- {o.name} at {o.position.tolist()}, "
f"graspable={o.graspable}" for o in objects
)
plan = self.llm.generate(
f"Task: {task}\n"
f"Scene:\n{scene_desc}\n"
f"Robot EE at: {robot_state.ee_pose.tolist()}\n"
f"Plan sequential manipulation steps:"
)
return self.parse_plan(plan)
def execute_with_feedback(self, task: str,
max_retries: int = 3) -> bool:
for attempt in range(max_retries):
image = self.robot.capture_image()
objects = self.observe(image)
plan = self.plan(task, objects, self.robot.get_state())
for step in plan:
code = self.generate_robot_code(step)
self.robot.execute(code)
new_image = self.robot.capture_image()
new_objects = self.observe(new_image)
if not self.verify_step(step, new_objects):
break
else:
return True
return False
def generate_robot_code(self, step: dict) -> str:
return self.llm.generate(
f"Generate ROS2 code for: {step}\n"
f"Use MoveIt2 for motion planning."
)
===== Architecture =====
graph TD
A[Task Instruction] --> B[Planner Agent]
B --> C[Sub-Task Sequence]
C --> D[Coder Agent]
D --> E[Robot-Executable Code]
E --> F[Robot Controller]
F --> G[Action Execution]
G --> H[Camera Observation]
H --> I[VLM Scene Analysis]
I --> J[Supervisor Agent]
J --> K{Step Successful?}
K -->|Yes| L{Task Complete?}
K -->|No| M[Failure Analysis]
M --> N{Replanning Needed?}
N -->|Code Fix| D
N -->|Plan Change| B
L -->|Yes| O[Task Success]
L -->|No| C
subgraph Multi-Robot - RoboClaw
P[Task Allocator] --> Q[Robot A]
P --> R[Robot B]
Q <--> S[Dialogue Protocol]
R <--> S
S --> T[Collective Reasoning]
T --> P
end
===== Key Results =====
^ System ^ Benchmark ^ vs. Baseline ^ Key Strength ^
| ALRM | RLBench long-horizon | +20% success vs. CAP/VoxPoser | Zero-shot, no pre-trained policies |
| ALRM | Real-world Franka | Validated on unseen tasks | Sim-to-real transfer |
| ManiAgent | Manipulation benchmarks | High autonomy | Novel object generalization |
| RoboClaw | Multi-robot coordination | Emergent collaborative behaviors | Dynamic task reassignment |
===== See Also =====
* [[game_playing_agents|Game Playing Agents]]
* [[image_editing_agents|Image Editing Agents]]
* [[devops_incident_agents|DevOps Incident Agents]]
===== References =====