This is an old revision of the document!

Robotic Manipulation Agents

LLM-driven closed-loop robotic control systems deploy multi-agent architectures where language models plan, generate executable code, and adapt via visual feedback to achieve zero-shot manipulation of novel objects in dynamic environments.

Overview

Traditional robotic manipulation relies on task-specific policies trained through extensive demonstration or reinforcement learning. LLM-based agents bypass this by leveraging pre-trained language understanding for zero-shot or few-shot task execution. Three systems exemplify this approach: ALRM uses multi-agent LLMs for zero-shot manipulation with planner-coder-supervisor roles, ManiAgent employs agentic skill selection with VLM-grounded replanning, and RoboClaw introduces multi-robot coordination through LLM-orchestrated dialogue.

Closed-Loop Control Architecture

The fundamental closed-loop pattern across all systems follows: observe, reason, act, observe:

<latex>\text{Loop}: s_t \xrightarrow{\text{VLM}} z_t \xrightarrow{\text{LLM}} a_t \xrightarrow{\text{Robot}} s_{t+1}</latex>

where $s_t$ is the sensory state (images, proprioception), $z_t$ is the language-grounded scene description, $a_t$ is the planned action, and $s_{t+1}$ is the resulting state. The loop continues until task completion or failure detection triggers replanning.

ALRM: Multi-Agent Zero-Shot Manipulation

ALRM (Agent-based LLM Robotic Manipulation) deploys a multi-agent architecture:

Planner Agent: Decomposes high-level tasks into sub-tasks via prompting (e.g., “stack blocks” becomes a sequence of pick, place, and verify operations)
Coder Agent: Translates sub-tasks into robot-executable code (e.g., PyBullet or ROS commands)
Supervisor Agent: Monitors execution, processes VLM feedback on state changes, and triggers replanning on failure

Zero-shot capability: No pre-trained manipulation policies are required. The LLM's in-context learning enables immediate deployment on unseen tasks.

Failure recovery: After each action, VLMs detect the state (e.g., bounding boxes, object positions); the supervisor integrates feedback (e.g., “grasp failed”) to prompt re-execution.

ManiAgent: Agentic Skill Selection

ManiAgent uses an LLM to dynamically select and parameterize manipulation primitives:

<latex>a_t = \text{LLM}(\text{skill\_library}, s_t, g) = (\text{skill}_k, \theta_k)</latex>

where $\text{skill}_k$ is selected from a library of primitives (grasp, place, push, rotate) and $\theta_k$ are the skill parameters (target pose, force, speed).

The VLM-grounded replanning loop adapts to unexpected outcomes:

<latex>s_{t+1} \neq s_{expected} \implies \text{replan}(s_{t+1}, g)</latex>

RoboClaw: Multi-Robot Coordination

RoboClaw orchestrates multiple robots through LLM-mediated dialogue:

Task Allocator: LLM distributes sub-tasks to available robots based on capabilities
Dialogue Protocol: Robots communicate shared observations via structured messages
Collective Reasoning: Group deliberation on shared state enables dynamic task reassignment

<latex>\text{Assignment}^* = \arg\min_{A} \sum_{i} C(r_i, \text{task}_{A(i)}) \quad \text{s.t.} \quad \text{capabilities}(r_i) \supseteq \text{req}(\text{task}_{A(i)})</latex>

where $C(r_i, \text{task})$ is the estimated cost of robot $r_i$ performing the task, and the assignment respects capability constraints.

Code Example

from dataclasses import dataclass
import numpy as np
 
@dataclass
class RobotState:
    joint_positions: np.ndarray
    gripper_open: bool
    ee_pose: np.ndarray  # end-effector position + orientation
 
@dataclass
class SceneObject:
    name: str
    bbox: np.ndarray
    position: np.ndarray
    graspable: bool
 
class RoboticManipAgent:
    def __init__(self, llm, vlm, robot_controller):
        self.llm = llm
        self.vlm = vlm
        self.robot = robot_controller
 
    def observe(self, camera_image: np.ndarray) -> list[SceneObject]:
        description = self.vlm.describe_scene(camera_image)
        objects = self.vlm.detect_objects(camera_image)
        return objects
 
    def plan(self, task: str, objects: list[SceneObject],
             robot_state: RobotState) -> list[dict]:
        scene_desc = "\n".join(
            f"- {o.name} at {o.position.tolist()}, "
            f"graspable={o.graspable}" for o in objects
        )
        plan = self.llm.generate(
            f"Task: {task}\n"
            f"Scene:\n{scene_desc}\n"
            f"Robot EE at: {robot_state.ee_pose.tolist()}\n"
            f"Plan sequential manipulation steps:"
        )
        return self.parse_plan(plan)
 
    def execute_with_feedback(self, task: str,
                              max_retries: int = 3) -> bool:
        for attempt in range(max_retries):
            image = self.robot.capture_image()
            objects = self.observe(image)
            plan = self.plan(task, objects, self.robot.get_state())
            for step in plan:
                code = self.generate_robot_code(step)
                self.robot.execute(code)
                new_image = self.robot.capture_image()
                new_objects = self.observe(new_image)
                if not self.verify_step(step, new_objects):
                    break
            else:
                return True
        return False
 
    def generate_robot_code(self, step: dict) -> str:
        return self.llm.generate(
            f"Generate ROS2 code for: {step}\n"
            f"Use MoveIt2 for motion planning."
        )

Architecture

graph TD A[Task Instruction] --> B[Planner Agent] B --> C[Sub-Task Sequence] C --> D[Coder Agent] D --> E[Robot-Executable Code] E --> F[Robot Controller] F --> G[Action Execution] G --> H[Camera Observation] H --> I[VLM Scene Analysis] I --> J[Supervisor Agent] J --> K{Step Successful?} K -->|Yes| L{Task Complete?} K -->|No| M[Failure Analysis] M --> N{Replanning Needed?} N -->|Code Fix| D N -->|Plan Change| B L -->|Yes| O[Task Success] L -->|No| C subgraph Multi-Robot - RoboClaw P[Task Allocator] --> Q[Robot A] P --> R[Robot B] Q <--> S[Dialogue Protocol] R <--> S S --> T[Collective Reasoning] T --> P end

Key Results

System	Benchmark	vs. Baseline	Key Strength
ALRM	RLBench long-horizon	+20% success vs. CAP/VoxPoser	Zero-shot, no pre-trained policies
ALRM	Real-world Franka	Validated on unseen tasks	Sim-to-real transfer
ManiAgent	Manipulation benchmarks	High autonomy	Novel object generalization
RoboClaw	Multi-robot coordination	Emergent collaborative behaviors	Dynamic task reassignment

AI Agent Knowledge Base

Sidebar

Table of Contents

Robotic Manipulation Agents

Overview

Closed-Loop Control Architecture

ALRM: Multi-Agent Zero-Shot Manipulation

ManiAgent: Agentic Skill Selection

RoboClaw: Multi-Robot Coordination

Code Example

Architecture

Key Results

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Robotic Manipulation Agents

Overview

Closed-Loop Control Architecture

ALRM: Multi-Agent Zero-Shot Manipulation

ManiAgent: Agentic Skill Selection

RoboClaw: Multi-Robot Coordination

Code Example

Architecture

Key Results

References

See Also

Page Tools