Research Project · Embodied AI

A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring

Australian Institute for Machine Learning (AIML) · Adelaide University

Execution-grounded runtime monitoring for robust real-world grasping.

Overview

Execution-grounded decision-making for real-world manipulation

Instead of treating grasp execution as a one-shot black box, this system exposes runtime outcomes as explicit states. A lightweight Watchdog monitors execution, surfaces events like SUCCESS or EMPTY, and enables a bounded policy to finalize, retry, or ask for clarification.

01

Explicit execution-state monitoring

Transforms noisy physical feedback into discrete, decision-ready states for the agent loop.

02

Bounded recovery without retraining

Wraps the learned manipulation primitive instead of changing the underlying grasp model.

03

Robust under ambiguity and distractors

Maintains target consistency across clutter, target similarity, and induced empty grasp scenarios.

Loop Teaser

Observe → Act → Evaluate → Decide

A compact summary of the physical agentic loop and its bounded recovery logic.

Physical agentic loop teaser
Method Overview

Agent-centric architecture

Structured goals, perception conditioning, outcome-aware execution, and a bounded decision policy are organized into a single physical agentic loop.

System architecture diagram
Core Loop

Observe → Act → Evaluate → Decide

01

Observe

Receive the structured task goal and the current RGB-D scene state.

02

Act

Execute the unmodified manipulation primitive on the selected target.

03

Evaluate

Infer discrete outcomes from gripper telemetry and execution traces.

04

Decide

Finalize, retry once, or escalate through clarification when uncertainty persists.

Watchdog Runtime States
SUCCESS EMPTY WEAK SLIP STALL TIMEOUT
Decision → FINALIZE / RETRY / WAIT_CLARIFY
Recovery Example

Outcome-driven recovery timeline

A recoverable empty grasp triggers a single bounded retry before escalation.

Auto retry recovery timeline
Representative Workflows

Real-world behavior traces

From distractor-heavy scenes to color and spatial ambiguity, the system keeps the target grounded while adapting to execution outcomes.

Representative workflow figure
Selecting the target-colored cup from two differently colored cups

Selecting the target-colored cup from two differently colored cups

Color-conditioned grounding for choosing the intended cup among visually distinct candidates.

Distractor-aware object selection with a nearby non-target object

Distractor-aware selection with a nearby non-target object

Maintains the intended target despite a salient distractor placed next to the workspace object.

Spatial ambiguity across similar cups

Spatial ambiguity across similar cups

Grounds the requested target among similar cups under spatial ambiguity and bounded decision-making.

Toy grasping under distractor presence

Toy grasping under distractor presence

Selects the intended toy while ignoring the nearby cup and preserving semantic target consistency.

Citation

BibTeX

@article{wang2026physicalagenticloop,
  title   = {A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring},
  author  = {Wang, Wenze and Hosseinzadeh, Mehdi and Dayoub, Feras},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026},
  note    = {Preprint under review; update identifier after announcement}
}