====== Robotic Manipulation Agents ======

LLM-driven closed-loop robotic control systems deploy multi-agent architectures where language models plan, generate executable code, and adapt via visual feedback to achieve zero-shot manipulation of novel objects in dynamic environments.((https://arxiv.org/abs/2601.19510))

===== Overview =====

Traditional robotic manipulation relies on task-specific policies trained through extensive demonstration or reinforcement learning. LLM-based agents bypass this by leveraging pre-trained language understanding for zero-shot or few-shot task execution. Three systems exemplify this approach: ALRM uses multi-agent LLMs for zero-shot manipulation with planner-coder-supervisor roles,((https://arxiv.org/abs/2601.19510)) ManiAgent employs agentic skill selection with VLM-grounded replanning,((https://arxiv.org/abs/2510.11660)) and RoboClaw introduces multi-robot coordination through LLM-orchestrated dialogue.((https://arxiv.org/abs/2603.11558))

===== Closed-Loop Control Architecture =====

The fundamental closed-loop pattern across all systems follows: observe, reason, act, observe:

<latex>\text{Loop}: s_t \xrightarrow{\text{VLM}} z_t \xrightarrow{\text{LLM}} a_t \xrightarrow{\text{Robot}} s_{t+1}</latex>

where $s_t$ is the sensory state (images, proprioception), $z_t$ is the language-grounded scene description, $a_t$ is the planned action, and $s_{t+1}$ is the resulting state. The loop continues until task completion or failure detection triggers replanning.

===== ALRM: Multi-Agent Zero-Shot Manipulation =====

ALRM (Agent-based LLM Robotic Manipulation) deploys a multi-agent architecture:((https://arxiv.org/abs/2601.19510))

  * **Planner Agent**: Decomposes high-level tasks into sub-tasks via prompting (e.g., "stack blocks" becomes a sequence of pick, place, and verify operations)
  * **Coder Agent**: Translates sub-tasks into robot-executable code (e.g., PyBullet or ROS commands)
  * **Supervisor Agent**: Monitors execution, processes VLM feedback on state changes, and triggers replanning on failure

**Zero-shot capability**: No pre-trained manipulation policies are required. The LLM's in-context learning enables immediate deployment on unseen tasks.

**Failure recovery**: After each action, VLMs detect the state (e.g., bounding boxes, object positions); the supervisor integrates feedback (e.g., "grasp failed") to prompt re-execution.

===== ManiAgent: Agentic Skill Selection =====

ManiAgent uses an LLM to dynamically select and parameterize manipulation primitives:((https://arxiv.org/abs/2510.11660))

<latex>a_t = \text{LLM}(\text{skill\_library}, s_t, g) = (\text{skill}_k, \theta_k)</latex>

where $\text{skill}_k$ is selected from a library of primitives (grasp, place, push, rotate) and $\theta_k$ are the skill parameters (target pose, force, speed).

The VLM-grounded replanning loop adapts to unexpected outcomes:

<latex>s_{t+1} \neq s_{expected} \implies \text{replan}(s_{t+1}, g)</latex>

===== RoboClaw: Multi-Robot Coordination =====

RoboClaw orchestrates multiple robots through LLM-mediated dialogue:((https://arxiv.org/abs/2603.11558))

  * **Task Allocator**: LLM distributes sub-tasks to available robots based on capabilities
  * **Dialogue Protocol**: Robots communicate shared observations via structured messages
  * **Collective Reasoning**: Group deliberation on shared state enables dynamic task reassignment

<latex>\text{Assignment}^* = \arg\min_{A} \sum_{i} C(r_i, \text{task}_{A(i)}) \quad \text{s.t.} \quad \text{capabilities}(r_i) \supseteq \text{req}(\text{task}_{A(i)})</latex>

where $C(r_i, \text{task})$ is the estimated cost of robot $r_i$ performing the task, and the assignment respects capability constraints.

===== Code Example =====

<code python>
from dataclasses import dataclass
import numpy as np

@dataclass
class RobotState:
    joint_positions: np.ndarray
    gripper_open: bool
    ee_pose: np.ndarray  # end-effector position + orientation

@dataclass
class SceneObject:
    name: str
    bbox: np.ndarray
    position: np.ndarray
    graspable: bool

class RoboticManipAgent:
    def __init__(self, llm, vlm, robot_controller):
        self.llm = llm
        self.vlm = vlm
        self.robot = robot_controller

    def observe(self, camera_image: np.ndarray) -> list[SceneObject]:
        description = self.vlm.describe_scene(camera_image)
        objects = self.vlm.detect_objects(camera_image)
        return objects

    def plan(self, task: str, objects: list[SceneObject],
             robot_state: RobotState) -> list[dict]:
        scene_desc = "\n".join(
            f"- {o.name} at {o.position.tolist()}, "
            f"graspable={o.graspable}" for o in objects
        )
        plan = self.llm.generate(
            f"Task: {task}\n"
            f"Scene:\n{scene_desc}\n"
            f"Robot EE at: {robot_state.ee_pose.tolist()}\n"
            f"Plan sequential manipulation steps:"
        )
        return self.parse_plan(plan)

    def execute_with_feedback(self, task: str,
                              max_retries: int = 3) -> bool:
        for attempt in range(max_retries):
            image = self.robot.capture_image()
            objects = self.observe(image)
            plan = self.plan(task, objects, self.robot.get_state())
            for step in plan:
                code = self.generate_robot_code(step)
                self.robot.execute(code)
                new_image = self.robot.capture_image()
                new_objects = self.observe(new_image)
                if not self.verify_step(step, new_objects):
                    break
            else:
                return True
        return False

    def generate_robot_code(self, step: dict) -> str:
        return self.llm.generate(
            f"Generate ROS2 code for: {step}\n"
            f"Use MoveIt2 for motion planning."
        )
</code>

===== Architecture =====

<mermaid>
graph TD
    A[Task Instruction] --> B[Planner Agent]
    B --> C[Sub-Task Sequence]
    C --> D[Coder Agent]
    D --> E[Robot-Executable Code]
    E --> F[Robot Controller]
    F --> G[Action Execution]
    G --> H[Camera Observation]
    H --> I[VLM Scene Analysis]
    I --> J[Supervisor Agent]
    J --> K{Step Successful?}
    K -->|Yes| L{Task Complete?}
    K -->|No| M[Failure Analysis]
    M --> N{Replanning Needed?}
    N -->|Code Fix| D
    N -->|Plan Change| B
    L -->|Yes| O[Task Success]
    L -->|No| C
    subgraph Multi-Robot - RoboClaw
        P[Task Allocator] --> Q[Robot A]
        P --> R[Robot B]
        Q <--> S[Dialogue Protocol]
        R <--> S
        S --> T[Collective Reasoning]
        T --> P
    end
</mermaid>

===== Key Results =====

^ System ^ Benchmark ^ vs. Baseline ^ Key Strength ^
| ALRM | RLBench long-horizon | +20% success vs. CAP/VoxPoser | Zero-shot, no pre-trained policies |
| ALRM | Real-world Franka | Validated on unseen tasks | Sim-to-real transfer |
| ManiAgent | Manipulation benchmarks | High autonomy | Novel object generalization |
| RoboClaw | Multi-robot coordination | Emergent collaborative behaviors | Dynamic task reassignment |

===== See Also =====

  * [[game_playing_agents|Game Playing Agents]]
  * [[image_editing_agents|Image Editing Agents]]
  * [[devops_incident_agents|DevOps Incident Agents]]

===== References =====