TL;DR A flow matching architecture built on a pre-trained VLM for zero-shot task execution, language instruction following, and skill acquisition via fine-tuning.

Appeared in surveys