To solve robot manipulation tasks in real-world environments, CLIPU²Net is first employed to segment regions most relevant to the target specified by referring language. Geometric constraints are then applied to the segmented region, generating context-relevant motions for uncalibrated image-based visual servoing (UIBVS) control.


Abstract

In this paper, we perform robot manipulation activities in real-world environments with language contexts by integrating a compact referring image segmentation model into the robot's perception module. First, we propose CLIPU²Net, a lightweight referring image segmentation model designed for fine-grain boundary and structure segmentation from language expressions. Then, we deploy the model in an eye-in-hand visual servoing system to enact robot control in the real world. The key to our system is the representation of salient visual information as geometric constraints, linking the robot's visual perception to actionable commands. Experimental results on 46 real-world robot manipulation tasks demonstrate that our method outperforms traditional visual servoing methods relying on labor-intensive feature annotations, excels in fine-grain referring image segmentation with a compact decoder size of 6.6 MB, and supports robot control across diverse contexts.



Referring Image Segmentation

Taking advantage of CLIP and U²Net, CLIPU²Net is capable of generating smoothier saliency probability maps from referring language expressions.

Interpolation end reference image.

In the encoding part, CLIP embedding features are correlated through a series of Masked Multimodal Transformers, producing salient features governed by language. In the decoding part, a lightweight U²Net learns to extract multi-scale information, yielding finer structures and details in the prediction.



Specifying Geometric Constraints

Geometric constraint is used in uncalibrated visual servoing to define high-level actions from robot vision. Combinations of different constraints can be applied to maneuver the robot end-effector, achieving actions from visual inputs.

Interpolation end reference image.


Tasks

By leveraging CLIPU²Net and geometric constraints, our system translates salient features of points and lines into actionable motions.

"Grasp the door handle and open the closet door."

(Subtask 1: Align the gripper)

"Pick up the coke can on the table, and place it into a wooden basket."

(Subtask 1: Pick up the coke can)

"Grasp the handle of the spoon."

"Grasp the door handle and open the closet door."

(Subtask 2: Reach and grasp the door handle)

"Pick up the coke can on the table, and place it into a wooden basket."

(Subtask 2: Place the coke can into the basket)

"Reach and grasp the alcohol spray bottle."

"Pick up the eyeglasses on the table, then place it into a box."

(Subtask 1: Pick up eyeglasses)

"Pick up the teabag and put it into the region with an affordance to contain."

(Subtask 1: Pick up teabag)

"Reach and grasp the fruit on the table."

"Pick up the eyeglasses on the table, then place it into a box."

(Subtask 2: Place the eyeglasses into the box)

"Pick up the teabag and put it into the region with an affordance to contain."

(Subtask 2: Carry the teabag into the region with an affordance to contain)

"Grasp the knife's handle."







BibTeX

@article{jiang2024attnibvs,
      title={Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints},
      author={Chen Jiang, Allie Luo and Martin Jagersand},
      journal={arXiv preprint arXiv:2409.11518},
      year={2024}
  }