TL;DR Ego3D Position Encoding is introduced to inject 3D information into the input observations of the visual-language-action model, and Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids are proposed, facilitating learning generalizable and transferrable spatial action knowledge for cross-robot control.

Appeared in surveys