arXiv Weekly Digest — Week 08, 2026

Fetched: 2026-02-15 | Categories: cs.RO, cs.LG, cs.HC, cs.CV | Papers: 30


WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

Authors: Zhennan Jiang, Shangqing Zhou, Yutong Jiang, et al. | Submitted: 2026-02-15 | arXiv: 2602.13977 Categories: cs.RO, cs.AI

Research Background: Reinforcement learning promises to unlock capabilities beyond imitation learning for VLA models, but deploying RL directly on physical robots is impractical due to the massive interaction data requirement. Learned world models offer a simulation shortcut, but hallucination and long-horizon error accumulation corrupt the optimization signal.

Technical Approach: WoVR proposes a reliable world-model-based RL framework that explicitly regulates policy-world model interaction. It uses a controllable action-conditioned video world model for rollout stability, introduces Keyframe-Initialized Rollouts to reduce effective error depth, and performs World Model-Policy co-evolution to maintain alignment. Rather than assuming a faithful world model, WoVR manages its imperfections directly.

Key Takeaway: WoVR improves average LIBERO success from 39.95% to 69.2% (+29.3 points) and real-robot success from 61.7% to 91.7% (+30.0 points), showing that learned world models can serve as practical RL simulators when hallucination is explicitly controlled.

研究背景: 強化學習有望突破 VLA 模型僅靠模仿學習的上限,但直接在實體機器人上部署 RL 需要大量互動資料,難以實踐。以學習到的世界模型作為模擬器雖是捷徑,但幻覺與長視野誤差累積會污染優化訊號。

技術方法: WoVR 提出一個可靠的基於世界模型的 RL 框架,直接調控策略與世界模型間的互動。它採用可控動作條件化影片世界模型來穩定展開,引入關鍵幀初始化展開(Keyframe-Initialized Rollouts)縮短有效誤差深度,並透過世界模型-策略共同演化維持對齊。與其假設世界模型忠實,WoVR 直接管理其不完美。

核心發現: WoVR 將 LIBERO 平均成功率從 39.95% 提升至 69.2%(+29.3 分),真實機器人成功率從 61.7% 提升至 91.7%(+30.0 分),證明當幻覺受到明確控制時,世界模型可作為實用的 RL 模擬器。


HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models

Authors: Xin Yan, Zhenglin Wan, Feiyang Ye, et al. | Submitted: 2026-02-14 | arXiv: 2602.13710 Categories: cs.LG

Research Background: VLA models enable instruction-following embodied control but their large compute and memory footprints make deployment on resource-constrained edge robots impractical. Binarization to 1-bit precision can dramatically improve efficiency, but existing methods fail to account for the distribution gap between binarized and full-precision weights.

Technical Approach: HBVLA introduces a VLA-tailored binarization framework. It uses a policy-aware enhanced Hessian to identify weights critical for action generation, applies a sparse orthogonal transform to induce low-entropy states in non-salient weights, and performs group-wise 1-bit quantization in the Harr domain. This targets quantization error at the weights that most affect closed-loop action quality.

Key Takeaway: HBVLA retains 92.2% of full-precision performance on LIBERO (quantized OpenVLA-OFT) and 93.6% on SimplerEnv (CogAct), significantly outperforming state-of-the-art binarization methods and enabling practical ultra-low-bit VLA deployment on hardware-limited platforms.

研究背景: VLA 模型的龐大運算與記憶體需求阻礙了在資源受限邊緣機器人上的部署。雖然 1-bit 二值化能大幅提升效率,但現有方法無法彌補二值化與全精度權重間的分佈差距。

技術方法: HBVLA 提出 VLA 專用二值化框架,使用策略感知的增強 Hessian 找出對動作生成關鍵的權重,對非關鍵權重施加稀疏正交變換以產生低熵中間態,再以 Harr 域逐群組 1-bit 量化。這讓量化誤差集中在最影響閉迴路動作品質的權重上。

核心發現: HBVLA 在 LIBERO 保留 92.2% 全精度效能,在 SimplerEnv 保留 93.6%,顯著優於現有二值化方法,使硬體受限平台上的超低位元 VLA 部署成為可能。


AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge

Authors: Noriaki Hirose, Catherine Glossop, Dhruv Shah, Sergey Levine | Submitted: 2026-02-13 | arXiv: 2602.13476 Categories: cs.RO, cs.LG

Research Background: Robotic foundation models achieve strong generalization via internet-scale representations, but their massive computational cost creates prohibitive inference latency. In dynamic environments, this latency breaks the control loop and renders powerful models unsafe for real-time deployment.

Technical Approach: AsyncVLA decouples semantic reasoning from reactive execution through a hierarchical asynchronous design. A large foundation model runs on a remote workstation for high-level guidance, while a lightweight onboard Edge Adapter continuously refines actions at high frequency. An end-to-end finetuning protocol and trajectory re-weighting strategy that prioritizes dynamic interactions bridge the domain gap between these asynchronous streams.

Key Takeaway: AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines in real-world navigation tasks with communication delays up to 6 seconds, demonstrating that semantic intelligence and reactive execution can be effectively decoupled for edge robotics.

研究背景: 機器人基礎模型透過網路規模表示達到強大泛化能力,但龐大運算成本造成無法接受的推論延遲,在動態環境下使強大模型難以安全即時部署。

技術方法: AsyncVLA 透過階層式非同步設計將語意推理與反應性執行解耦。大型基礎模型在遠端工作站提供高層指引,輕量板載 Edge Adapter 持續以高頻精煉動作。端對端微調協定與優先處理動態互動的軌跡重新加權策略填補了兩者間的領域差距。

核心發現: AsyncVLA 在通訊延遲達 6 秒的真實導航任務中,比最先進基線高出 40% 的成功率,證明語意智慧與反應性執行可有效解耦以實現邊緣機器人應用。


Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

Authors: Rui Cai, Jun Guo, Xinze He, et al. | Submitted: 2026-02-13 | arXiv: 2602.12684 Categories: cs.RO, cs.LG

Research Background: VLA models have shown strong manipulation capabilities but practical deployment requires real-time execution without sacrificing generalization. Most existing models struggle with inference latency during real-robot rollouts, particularly for dexterous bimanual tasks.

Technical Approach: Xiaomi-Robotics-0 combines large-scale cross-embodiment pretraining on robot trajectories and vision-language data with an asynchronous execution training recipe. Specific techniques address inference latency during deployment, and action chunk timestep alignment ensures continuous, seamless real-time rollouts. The model is optimized to run on a consumer-grade GPU while maintaining high throughput.

Key Takeaway: Xiaomi-Robotics-0 achieves state-of-the-art performance across simulation benchmarks and high success rates on challenging bimanual manipulation tasks in the real world, with all code and model checkpoints publicly released to advance the field.

研究背景: VLA 模型展現出強大的操作能力,但實際部署需要在不犧牲泛化的前提下實現即時執行,大多數現有模型在真實機器人展開時受推論延遲所困,特別是對於靈巧雙臂任務。

技術方法: Xiaomi-Robotics-0 結合大規模跨機器人軌跡與視覺語言資料的預訓練,以及非同步執行訓練配方。特定技術解決部署時的推論延遲問題,動作塊時間步對齊確保連續無縫的即時展開,且模型可在消費級 GPU 上高吞吐量運行。

核心發現: Xiaomi-Robotics-0 在模擬基準測試中達到最先進效能,並在真實世界困難雙臂操作任務上取得高成功率,所有程式碼與模型檢查點公開發布。


JEPA-VLA: Video Predictive Embedding is Needed for VLA Models

Authors: Shangchen Miao, Ningya Feng, Jialong Wu, et al. | Submitted: 2026-02-12 | arXiv: 2602.11832 Categories: cs.CV, cs.RO

Research Background: VLA models built on pretrained VLMs still suffer from low sample efficiency and limited generalization. Current visual representations — whether from language-image contrastive learning or image-based self-supervised learning — fail to capture task-relevant dynamics and anticipatory policy priors critical for manipulation.

Technical Approach: JEPA-VLA argues that video predictive embeddings, specifically V-JEPA 2, are uniquely suited to VLAs because they discard unpredictable environment factors while encoding task-relevant temporal dynamics. The approach adaptively integrates these predictive embeddings into existing VLAs, providing embodied anticipatory knowledge about how the environment evolves under successful task execution.

Key Takeaway: JEPA-VLA yields substantial performance gains across LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks, demonstrating that the choice of visual representation backbone is a critical and often overlooked factor in VLA performance.

研究背景: 基於預訓練 VLM 的 VLA 模型仍存在樣本效率低與泛化能力有限的問題。無論是來自語言-圖像對比學習還是圖像自監督學習的視覺表示,都無法捕捉操作關鍵的任務相關動態與預期策略先驗。

技術方法: JEPA-VLA 指出影片預測嵌入(尤其是 V-JEPA 2)特別適合 VLA,因為它能丟棄不可預測的環境因素,同時編碼任務相關的時序動態。該方法將這些預測性嵌入自適應地整合到現有 VLA 中,提供機器人關於成功執行任務時環境如何演化的預期性知識。

核心發現: JEPA-VLA 在 LIBERO、LIBERO-plus、RoboTwin2.0 和真實機器人任務上取得顯著效能提升,說明視覺表示主幹的選擇是 VLA 效能中一個關鍵且常被忽視的因素。


VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Authors: Jingwen Sun, Wenyao Zhang, Zekun Qi, et al. | Submitted: 2026-02-10 | arXiv: 2602.10098 Categories: cs.RO, cs.CV

Research Background: Pretraining VLA policies on internet-scale video is appealing, but existing latent-action objectives anchor too closely to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, camera motion, and information leakage between observation and prediction.

Technical Approach: VLA-JEPA introduces a JEPA-style pretraining framework where a target encoder produces latent representations from future frames while the student pathway sees only the current observation. By predicting in latent space rather than pixel space, the model learns dynamics abstractions robust to camera motion and irrelevant background changes. The two-stage recipe — JEPA pretraining then action-head finetuning — avoids the multi-stage complexity of prior latent-action pipelines.

Key Takeaway: VLA-JEPA achieves consistent gains in generalization and robustness over existing methods on LIBERO, LIBERO-Plus, SimplerEnv, and real-world tasks, demonstrating that leakage-free latent prediction is a cleaner formulation for embodied world modeling.

研究背景: 在網路規模影片上預訓練 VLA 策略很有吸引力,但現有潛在動作目標過於依附像素變化而非動作相關的狀態轉換,使其容易受到外觀偏差、相機運動和觀測與預測間資訊洩漏的影響。

技術方法: VLA-JEPA 引入 JEPA 風格的預訓練框架,目標編碼器從未來幀產生潛在表示,而學生路徑僅看當前觀測。透過在潛在空間而非像素空間預測,模型學習對相機運動和無關背景變化具有魯棒性的動態抽象。兩階段配方——JEPA 預訓練後動作頭微調——避免了先前潛在動作管線的多階段複雜性。

核心發現: VLA-JEPA 在 LIBERO、LIBERO-Plus、SimplerEnv 和真實世界任務上取得一致的泛化和魯棒性提升,說明無洩漏潛在預測是具身世界建模的更簡潔表述。


Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

Authors: Liangzhi Shi, Shuaihang Chen, Feng Gao, et al. | Submitted: 2026-02-13 | arXiv: 2602.12628 Categories: cs.RO

Research Background: Simulation offers a scalable way to enrich VLA training, but most sim-real co-training methods rely on supervised finetuning (SFT), treating simulation as a static demonstration source without exploiting closed-loop interaction. This limits real-world gains and generalization.

Technical Approach: The RL-Co framework uses interactive simulation while preserving real-world capabilities through a two-stage design: warm-start with SFT on a mixture of real and simulated demonstrations, then finetuning with RL in simulation combined with an auxiliary supervised loss on real-world data to anchor the policy and mitigate catastrophic forgetting. The approach is evaluated across four real-world tabletop tasks with two VLA architectures (OpenVLA and π0.5).

Key Takeaway: RL co-training yields +24% real-world success on OpenVLA and +20% on π0.5 compared to SFT-based co-training, while also improving generalization to unseen task variations and substantially reducing real-world data requirements.

研究背景: 模擬提供了豐富 VLA 訓練的可擴展方式,但大多數模擬-真實協同訓練方法依賴監督微調,將模擬視為靜態示範來源而未利用閉迴路互動,限制了真實世界的效能提升。

技術方法: RL-Co 框架透過兩階段設計在利用互動模擬的同時保留真實世界能力:先以真實與模擬示範混合的 SFT 熱啟動策略,再在模擬中以 RL 微調並加入真實資料的輔助監督損失以錨定策略、緩解災難性遺忘。在四個真實桌面操作任務上評估了兩種 VLA 架構(OpenVLA 與 π0.5)。

核心發現: RL 協同訓練相比 SFT 協同訓練在 OpenVLA 上提升 +24%、π0.5 上提升 +20% 的真實世界成功率,同時改善對未見任務變體的泛化能力,並大幅降低真實世界資料需求。


CRAFT: Adapting VLA Models to Contact-rich Manipulation via Force-aware Curriculum Fine-tuning

Authors: Yike Zhang, Yaonan Wang, Xinxin Sun, et al. | Submitted: 2026-02-13 | arXiv: 2602.12532 Categories: cs.RO

Research Background: VLA models excel at general instruction following but struggle with contact-rich manipulation tasks — precise alignment, stable contact maintenance, and deformable object handling — where force feedback is critical but typically absent from VLA training.

Technical Approach: CRAFT introduces a force-aware curriculum finetuning framework that integrates a variational information bottleneck to regulate vision and language embeddings during early training, initially forcing the model to prioritize force signals before progressively restoring multimodal access. A homologous leader-follower teleoperation system collects synchronized vision, language, and force data across diverse contact-rich tasks to enable this training.

Key Takeaway: CRAFT consistently improves task success on contact-rich manipulation, generalizes to unseen objects and novel task variations, and adapts across diverse VLA architectures, establishing force awareness as a trainable and transferable capability.

研究背景: VLA 模型擅長通用指令跟隨,但在接觸豐富的操作任務(精確對齊、穩定接觸維持、可變形物體處理)上表現欠佳,這些任務中力回饋至關重要,但通常不在 VLA 訓練範疇內。

技術方法: CRAFT 引入力感知課程微調框架,整合變分資訊瓶頸以在早期訓練中調節視覺與語言嵌入,初始強迫模型優先關注力訊號,再逐步恢復多模態存取。同源主從式遙操作系統收集跨多樣接觸豐富任務的同步視覺、語言和力資料。

核心發現: CRAFT 持續改善接觸豐富操作的任務成功率,泛化到未見物體和新任務變體,並跨多種 VLA 架構自適應,確立了力感知作為可訓練且可遷移能力的地位。


LongNav-R1: Horizon-Adaptive Multi-Turn RL for Long-Horizon VLA Navigation

Authors: Yue Hu, Avery Xi, Qixin Xiao, et al. | Submitted: 2026-02-12 | arXiv: 2602.12351 Categories: cs.RO, cs.CV

Research Background: Long-horizon navigation with VLA models remains challenging because single-turn paradigms cannot reason about causal effects of historical interactions, and behavioral rigidity from human demonstrations limits adaptation to diverse environments.

Technical Approach: LongNav-R1 reformulates navigation as a continuous multi-turn conversation between the VLA policy and the environment, enabling online learning from diverse trajectory generation. Horizon-Adaptive Policy Optimization explicitly accounts for varying horizon lengths during advantage estimation, enabling accurate temporal credit assignment over extended sequences and preventing collapse during long-horizon tasks.

Key Takeaway: With only 4,000 rollout trajectories, LongNav-R1 boosts the Qwen3-VL-2B success rate from 64.3% to 73.0% on object navigation benchmarks, with validated zero-shot real-world performance in long-horizon navigation settings.

研究背景: VLA 模型的長視野導航仍具挑戰性,因為單輪範式無法推理歷史互動的因果效應,而來自人類示範的行為剛性限制了對多樣化環境的適應。

技術方法: LongNav-R1 將導航重新表述為 VLA 策略與環境之間的連續多輪對話,透過多樣化軌跡生成實現線上學習。視野自適應策略優化在優勢估計中明確考慮不同視野長度,對延伸序列實現準確的時序功勞分配,防止長視野任務中的崩潰。

核心發現: 僅使用 4,000 條展開軌跡,LongNav-R1 在物體導航基準上將 Qwen3-VL-2B 成功率從 64.3% 提升至 73.0%,並在長視野真實世界導航中驗證了零樣本效能。


Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

Authors: Jacky Kwok, Xilun Zhang, Mengdi Xu, et al. | Submitted: 2026-02-12 | arXiv: 2602.12281 Categories: cs.RO, cs.AI, eess.SY

Research Background: VLA models still generate actions that misalign with given instructions — the “intention-action gap” — and simply scaling policy pretraining has diminishing returns for this alignment problem.

Technical Approach: The paper investigates test-time verification as an alternative to pretraining scaling. CoVer, a contrastive verifier for vision-language-action alignment, scales gracefully with additional compute and data. CoVer-VLA, a hierarchical test-time verification pipeline, precomputes rephrased instructions from a VLM, generates multiple action candidates, and selects the optimal combination. Joint scaling of rephrased instructions and action candidates is shown to increase test-time sample diversity more efficiently than scaling each dimension independently.

Key Takeaway: CoVer-VLA yields 22% gains in-distribution and 13% out-of-distribution on SIMPLER, with 45% improvement in real-world experiments — demonstrating that at deployment time, scaling verification rather than training data can be a more compute-efficient path to VLA alignment.

研究背景: VLA 模型仍然生成與給定指令不符的動作——「意圖-動作差距」——而單純擴大策略預訓練對此對齊問題的邊際收益遞減。

技術方法: 本文研究測試時驗證作為預訓練擴展的替代方案。CoVer 是一個視覺語言動作對齊的對比驗證器,可隨額外運算和資料優雅擴展。CoVer-VLA 是一個階層式測試時驗證管線,預計算來自 VLM 的改寫指令,生成多個動作候選,並選擇最佳組合。

核心發現: CoVer-VLA 在 SIMPLER 上實現分佈內 22%、分佈外 13% 的提升,在真實世界實驗中提升 45%,證明在部署時擴展驗證而非訓練資料可能是更計算高效的 VLA 對齊路徑。


Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

Authors: William Chen, Jagdeep Singh Bhatia, Catherine Glossop, et al. | Submitted: 2026-02-13 | arXiv: 2602.13193 Categories: cs.RO

Research Background: Hierarchical robot control typically uses VLMs to reason over language instructions passed to VLA models, but natural language as the interface between VLM and VLA fundamentally limits how much high-level reasoning can steer low-level behavior.

Technical Approach: Steerable Policies are VLAs trained on rich synthetic commands at multiple levels of abstraction — subtasks, motions, and grounded pixel coordinates — rather than just natural language. This richer controllability enables pretrained VLMs to effectively steer low-level actions via in-context reasoning. Both a learned high-level embodied reasoner and off-the-shelf VLMs are demonstrated as controllers.

Key Takeaway: Steerable Policies outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines across extensive real-world manipulation experiments, including on challenging generalization and long-horizon tasks, by expanding the interface between high-level reasoning and low-level control.

研究背景: 階層式機器人控制通常使用 VLM 對傳遞給 VLA 的語言指令進行推理,但自然語言作為 VLM 與 VLA 之間的介面根本限制了高層推理能影響低層行為的程度。

技術方法: Steerable Policies 是在多個抽象層級豐富合成命令上訓練的 VLA——子任務、運動和像素座標——而不僅僅是自然語言。這種更豐富的可控性使預訓練 VLM 能夠通過上下文推理有效引導低層動作。

核心發現: Steerable Policies 在廣泛的真實世界操作實驗中超越了先前的具身推理 VLA 和基於 VLM 的階層基線,包括在具有挑戰性的泛化和長視野任務上,通過擴展高層推理與低層控制之間的介面實現。


ForeAct: Steering Your VLA with Efficient Visual Foresight Planning

Authors: Zhuoyang Zhang, Shang Yang, Qinghao Hu, et al. | Submitted: 2026-02-12 | arXiv: 2602.12322 Categories: cs.RO, cs.AI

Research Background: VLA models converting language instructions to actions in open-world environments must handle both high-level semantic reasoning and low-level visuo-motor inference simultaneously, which creates competing objectives that limit accuracy and generalization.

Technical Approach: ForeAct introduces Visual Foresight Planning, a plug-in planner for VLAs that generates imagined future observations and subtask descriptions. A highly efficient foresight image generation module predicts a 640×480 future observation in 0.33s on an H100, while a VLM reasons over tasks and generates subtask descriptions for both the generator and VLA. State-of-the-art VLAs integrate this by augmenting their visual inputs — no architectural modification required. The generator is pretrained on over 1 million multi-task cross-embodiment episodes.

Key Takeaway: ForeAct achieves an average success rate of 87.4% on 11 diverse multi-step real-world tasks, a +40.9% absolute improvement over the π0 baseline (46.5%), demonstrating that separating visual foresight from action execution dramatically improves open-world manipulation.

研究背景: VLA 模型在開放世界環境中將語言指令轉換為動作,必須同時處理高層語意推理與低層視覺運動推理,這兩個相互競爭的目標限制了準確性和泛化能力。

技術方法: ForeAct 引入視覺預見規劃,一個 VLA 的即插即用規劃器,生成想像的未來觀測和子任務描述。高效預見圖像生成模組在 H100 上 0.33 秒內預測 640×480 未來觀測,VLM 推理任務並生成子任務描述。現有 VLA 僅需增強視覺輸入即可整合,無需架構修改。生成器在超過 100 萬多任務跨機器人情節上預訓練。

核心發現: ForeAct 在 11 個多樣化多步驟真實世界任務上達到 87.4% 的平均成功率,比 π0 基線(46.5%)絕對提升 +40.9%,說明將視覺預見與動作執行分離可大幅改善開放世界操作。


Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

Authors: Ye Wang, Sipeng Zheng, Hao Luo, et al. | Submitted: 2026-02-10 | arXiv: 2602.09722 Categories: cs.RO

Research Background: While scaling data is a dominant recipe for VLA improvement, it remains unclear whether this translates to robotics where training data is inherently heterogeneous across embodiments, sensors, and action spaces — raising the question of whether naive scaling can actually hurt performance.

Technical Approach: A systematic controlled study of VLA scaling using a VLM backbone with flow-matching ablates three dimensions: (1) physical alignment via a unified end-effector (EEF)-relative action representation; (2) embodiment mixture — testing whether pooling heterogeneous robot datasets helps or hurts; (3) training regularization strategies including sensory dropout and multi-stage finetuning. A Grouped Blind Ensemble evaluation protocol reduces experimenter bias.

Key Takeaway: EEF-relative action representation is critical for cross-embodiment transfer; naively pooling heterogeneous datasets often induces negative transfer; and common regularization strategies do not consistently improve performance — challenging several common assumptions about scaling embodied AI.

研究背景: 雖然擴展資料是 VLA 改進的主流方案,但目前仍不清楚這是否適用於機器人領域——因為訓練資料在機器人、感測器和動作空間方面本質上是異質的,引發了樸素擴展是否反而有害的疑問。

技術方法: 使用帶流匹配的 VLM 主幹進行系統性受控 VLA 擴展研究,消融三個維度:(1) 透過統一末端執行器(EEF)相對動作表示的物理對齊;(2) 機器人混合——測試是否匯集異質機器人資料集有益或有害;(3) 訓練正則化策略,包括感測器 Dropout 和多階段微調。分組盲評估協定減少實驗者偏差。

核心發現: EEF 相對動作表示對跨機器人遷移至關重要;樸素匯集異質資料集常導致負遷移;常見正則化策略不能一致改善效能——挑戰了幾個關於具身 AI 擴展的常見假設。


BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation

Authors: Yucheng Hu, Jianke Zhang, Yuanfei Luo, et al. | Submitted: 2026-02-10 | arXiv: 2602.09849 Categories: cs.RO

Research Background: Long-horizon manipulation requires reasoning about tasks, foreseeing physical outcomes, and generating precise actions — capabilities that existing VLA models typically handle in isolation (either linguistic planning or visual forecasting, but rarely both simultaneously guiding action generation).

Technical Approach: BagelVLA integrates linguistic planning, visual forecasting, and action generation within a single framework initialized from a pretrained unified understanding and generative model. It is trained to interleave textual reasoning and visual prediction directly into the action execution loop. Residual Flow Guidance (RFG) efficiently couples these modalities by initializing from the current observation and leveraging single-step denoising to extract predictive visual features with minimal latency.

Key Takeaway: BagelVLA outperforms existing baselines by a significant margin on multiple simulated and real-world benchmarks, particularly in tasks requiring multi-stage reasoning, demonstrating the value of unified interleaved prediction-action generation.

研究背景: 長視野操作需要推理任務、預見物理結果並生成精確動作——現有 VLA 模型通常孤立地處理這些能力(語言規劃或視覺預測,但鮮少同時引導動作生成)。

技術方法: BagelVLA 在一個統一框架中整合語言規劃、視覺預測和動作生成,從預訓練的統一理解與生成模型初始化。訓練使其將文字推理和視覺預測直接交錯進入動作執行迴圈。殘差流引導(RFG)透過從當前觀測初始化並利用單步去噪提取預測性視覺特徵,以最低延遲高效耦合這些模態。

核心發現: BagelVLA 在多個模擬和真實世界基準上顯著超越現有基線,特別是在需要多階段推理的任務中,展示了統一交錯預測-動作生成的價值。


GRAIL: Goal Recognition Alignment through Imitation Learning

Authors: Osher Elhadad, Felipe Meneguzzi, Reuth Mirsky | Submitted: 2026-02-15 | arXiv: 2602.14252 Categories: cs.AI, cs.LG, cs.RO

Research Background: Understanding an agent’s goals from its behavior is fundamental to AI alignment and human-robot teaming. Existing goal recognition methods rely on optimal goal-oriented policies, which diverge from real agents’ suboptimal, biased behavior, degrading recognition accuracy.

Technical Approach: GRAIL uses imitation learning and inverse reinforcement learning to learn one goal-directed policy per candidate goal directly from potentially suboptimal demonstration trajectories. Each learned policy captures behavioral biases and suboptimality, and a partial trajectory is scored against each in a single forward pass — retaining the one-shot inference efficiency of classical methods while using learned policies that model real behavior.

Key Takeaway: GRAIL increases the F1-score by more than 0.5 under systematically biased optimal behavior and achieves gains of 0.1-0.3 under suboptimal behavior, enabling scalable and robust goal recognition that accounts for real human behavior patterns.

研究背景: 從行為中理解智能體的目標是 AI 對齊和人機協作的基礎。現有目標識別方法依賴最優目標導向策略,這與真實智能體的次優、有偏行為不符,降低了識別準確性。

技術方法: GRAIL 使用模仿學習和逆強化學習,直接從可能次優的示範軌跡中為每個候選目標學習一個目標導向策略。每個學習到的策略捕捉行為偏差和次優性,在單次前向傳播中對每個策略評分——保留了經典方法的一次性推理效率,同時使用建模真實行為的學習策略。

核心發現: GRAIL 在系統性偏差最優行為下 F1 分數提升超過 0.5,在次優行為下提升 0.1-0.3,實現了考慮真實人類行為模式的可擴展且強健的目標識別。


TactAlign: Human-to-Robot Policy Transfer via Tactile Alignment

Authors: Youngsun Wi, Jessica Yin, Elvis Xiang, et al. | Submitted: 2026-02-14 | arXiv: 2602.13579 Categories: cs.RO

Research Background: Human demonstrations via wearable tactile devices provide fast, dexterous supervision for policy learning, but transferring human tactile signals to robots with different embodiments and sensing modalities remains a fundamental challenge that existing approaches address only for identical sensor setups.

Technical Approach: TactAlign uses a rectified flow to transform human and robot tactile observations into a shared latent representation without paired datasets, manual labels, or privileged information. The method enables cross-embodiment latent transport guided by hand-object interaction-derived pseudo-pairs, effectively bridging the sensing gap between human gloves and robot tactile sensors.

Key Takeaway: TactAlign improves human-to-robot policy transfer across multiple contact-rich tasks (pivoting, insertion, lid closing), generalizes to unseen objects with under 5 minutes of human data, and achieves zero-shot transfer on a highly dexterous task (light bulb screwing).

研究背景: 透過可穿戴觸覺設備的人類示範為策略學習提供快速靈巧的監督,但將人類觸覺訊號遷移到具有不同機器人形體和感測模態的機器人仍是基本挑戰,現有方法僅解決相同感測器設定的情況。

技術方法: TactAlign 使用修正流將人類和機器人觸覺觀測轉換為共享潛在表示,無需配對資料集、手動標籤或特權資訊。該方法透過手-物互動衍生的偽對配對,實現跨機器人形體潛在傳輸,有效填補人類手套與機器人觸覺感測器之間的感知差距。

核心發現: TactAlign 改善了多個接觸豐富任務(旋轉、插入、蓋蓋子)的人機策略遷移,使用不到 5 分鐘的人類資料泛化到未見物體,並在高度靈巧任務(螺旋燈泡)上實現零樣本遷移。


MOTIF: Learning Action Motifs for Few-shot Cross-Embodiment Transfer

Authors: Heng Zhi, Wentao Tan, Lei Zhu, et al. | Submitted: 2026-02-14 | arXiv: 2602.13764 Categories: cs.RO, cs.AI

Research Background: Cross-embodiment transfer for VLA models remains challenging due to kinematic heterogeneity and the high cost of real-world demonstration collection. Existing shared-private architectures have limited capacity in private parameters and lack explicit adaptation mechanisms.

Technical Approach: MOTIF decouples embodiment-agnostic spatiotemporal patterns (“action motifs”) from heterogeneous action data using vector quantization with progress-aware alignment and embodiment adversarial constraints to ensure temporal and cross-embodiment consistency. A lightweight predictor infers these motifs from real-time inputs to guide a flow-matching policy, fusing them with robot-specific states for action generation on new embodiments.

Key Takeaway: MOTIF significantly outperforms strong baselines in few-shot transfer scenarios by 6.5% in simulation and 43.7% in real-world settings, demonstrating that learning reusable motion primitives is a practical path to efficient cross-embodiment VLA transfer.

研究背景: VLA 模型的跨機器人形體遷移因運動學異質性和真實世界示範收集的高成本而仍具挑戰性。現有共享-私有架構在私有參數容量上有限,缺乏明確的自適應機制。

技術方法: MOTIF 使用帶進度感知對齊和機器人形體對抗約束的向量量化,從異質動作資料中解耦機器人形體無關的時空模式(「動作主題」),以確保時序和跨機器人形體一致性。輕量預測器從即時輸入推斷這些主題以引導流匹配策略,融合機器人特定狀態以在新機器人形體上生成動作。

核心發現: MOTIF 在少樣本遷移場景中模擬環境超越強基線 6.5%,真實世界超越 43.7%,說明學習可重用運動原語是實現高效跨機器人形體 VLA 遷移的實用路徑。


EasyMimic: A Low-Cost Framework for Robot Imitation Learning from Human Videos

Authors: Tao Zhang, Song Xia, Ye Wang, Qin Jin | Submitted: 2026-02-12 | arXiv: 2602.11464 Categories: cs.RO

Research Background: Robot imitation learning is hindered by the high cost of large-scale real-world data collection, especially for low-cost home robots that must be both user-friendly and affordable. Human videos offer an abundant alternative but require bridging the human-to-robot domain gap.

Technical Approach: EasyMimic extracts 3D hand trajectories from standard RGB camera videos, maps them to gripper control space via an action alignment module, and bridges the visual domain gap with a simple hand visual augmentation strategy. Co-training on processed human data plus a small amount of robot data enables rapid adaptation to new tasks on low-cost platforms (LeRobot).

Key Takeaway: EasyMimic achieves high performance across various manipulation tasks on a low-cost platform, significantly reducing reliance on expensive robot data collection and offering a practical path for bringing intelligent robots into homes.

研究背景: 機器人模仿學習受大規模真實世界資料收集高成本的阻礙,對必須兼具使用者友好性和低成本的家用機器人尤為如此。人類影片提供豐富的替代方案,但需要填補人機領域差距。

技術方法: EasyMimic 從標準 RGB 相機影片提取 3D 手部軌跡,透過動作對齊模組映射到夾爪控制空間,並以簡單的手部視覺增強策略填補視覺領域差距。在處理後的人類資料加少量機器人資料上協同訓練,實現在低成本平台(LeRobot)上快速適應新任務。

核心發現: EasyMimic 在低成本平台的多種操作任務上達到高效能,大幅降低對昂貴機器人資料收集的依賴,為將智慧機器人帶入家庭提供實用路徑。


Scaling Single Human Demonstrations for Imitation Learning using Generative Foundational Models

Authors: Nick Heppert, Minh Quang Nguyen, Abhinav Valada | Submitted: 2026-02-13 | arXiv: 2602.12734 Categories: cs.RO

Research Background: Imitation learning requires large numbers of robot demonstrations, which are expensive to collect. Single human demonstrations are far easier to obtain but transfer to robots is non-trivial due to embodiment differences.

Technical Approach: Real2Gen extracts key information from a single human demonstration, transfers it to a simulation environment, where a programmable expert agent demonstrates the task arbitrarily many times to generate unlimited flow-matching policy training data. The purely simulation-trained policy is then deployed zero-shot in the real world, leveraging generative foundation models for scalable data augmentation.

Key Takeaway: Real2Gen shows an average 26.6% increase in success rate compared to a recent baseline, with better policy generalization due to training data abundance and diversity, and demonstrates zero-shot real-world deployment from simulation-only training.

研究背景: 模仿學習需要大量機器人示範,收集成本高昂。單一人類示範雖更易獲得,但因機器人形體差異而難以遷移到機器人。

技術方法: Real2Gen 從單一人類示範提取關鍵資訊,將其遷移到模擬環境,可程式化的專家智能體在此任意多次示範任務,生成無限量的流匹配策略訓練資料。純模擬訓練的策略隨後在真實世界零樣本部署,利用生成式基礎模型進行可擴展資料增強。

核心發現: Real2Gen 與近期基線相比平均成功率提升 26.6%,由於訓練資料豐富多樣而具有更好的策略泛化能力,並展示了從純模擬訓練的零樣本真實世界部署。


Flow-Enabled Generalization to Human Demonstrations in Few-Shot Imitation Learning

Authors: Runze Tang, Penny Sweetser | Submitted: 2026-02-11 | arXiv: 2602.10594 Categories: cs.RO, cs.LG

Research Background: Few-shot imitation learning can leverage human videos to reduce robot demonstration requirements, but prior flow-based methods focus on object flow or specific hand/robot points rather than interaction motion, and struggle to generalize to scenarios seen only in human videos.

Technical Approach: SFCrP combines a Scene Flow prediction model for Cross-embodiment learning (SFCr) that learns from both robot and human videos and predicts any-point trajectories, with a Flow and Cropped point cloud conditioned Policy (FCrP) that follows the general flow motion and adjusts actions based on observations for precision. The approach uses dense scene-level optical flow as the cross-embodiment representation.

Key Takeaway: SFCrP outperforms state-of-the-art baselines across various real-world task settings and exhibits strong spatial and instance generalization to scenarios seen only in human videos, demonstrating that scene-level flow is a superior cross-embodiment representation.

研究背景: 少樣本模仿學習可利用人類影片減少機器人示範需求,但先前的流方法聚焦於物體流或特定手/機器人點而非互動運動,難以泛化到僅在人類影片中見過的場景。

技術方法: SFCrP 結合跨機器人形體學習的場景流預測模型(SFCr)——同時從機器人和人類影片學習並預測任意點軌跡——以及流和裁剪點雲條件化策略(FCrP),跟隨一般流動運動並根據觀測調整動作以提高精度。

核心發現: SFCrP 在各種真實世界任務設定中超越最先進基線,對僅在人類影片中見過的場景表現出強大的空間和實例泛化能力,說明場景級流是更優越的跨機器人形體表示。


EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration

Authors: Modi Shi, Shijia Peng, Jin Chen, et al. | Submitted: 2026-02-10 | arXiv: 2602.10106 Categories: cs.RO

Research Background: Human demonstration data scales naturally and offers rich environmental diversity, but its potential for humanoid loco-manipulation — far more data-hungry than arm manipulation — remains largely unexplored due to the substantial embodiment gap between humans and humanoid robots.

Technical Approach: EgoHumanoid co-trains a VLA policy using abundant egocentric human demonstrations alongside limited robot data. A systematic alignment pipeline spans hardware design to data processing: a portable system enables scalable human data collection, view alignment reduces visual domain discrepancies from camera height and perspective variation, and action alignment maps human motions into a unified kinematically feasible action space.

Key Takeaway: Incorporating robot-free egocentric data significantly outperforms robot-only baselines by 51%, particularly in unseen environments, establishing the first practical framework for scaling humanoid loco-manipulation with human video data.

研究背景: 人類示範資料天然可擴展且提供豐富的環境多樣性,但其在人形機器人移動操作上的潛力——比手臂操作需要更多資料——因人類與人形機器人之間巨大的機器人形體差距而基本上未被探索。

技術方法: EgoHumanoid 使用豐富的自中心人類示範與有限機器人資料共同訓練 VLA 策略。系統性對齊管線從硬體設計延伸到資料處理:便攜式系統實現可擴展的人類資料收集,視角對齊減少相機高度和視角差異帶來的視覺領域差距,動作對齊將人類運動映射到統一的運動學可行動作空間。

核心發現: 納入無機器人自中心資料比純機器人基線顯著提升 51%,尤其在未見環境中,建立了第一個使用人類影片資料擴展人形機器人移動操作的實用框架。


DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos

Authors: Juncheng Mu, Sizhe Yang, Yiming Bao, et al. | Submitted: 2026-02-10 | arXiv: 2602.10105 Categories: cs.RO

Research Background: Data scarcity fundamentally limits the generalization of bimanual dexterous manipulation — real-world data collection for dexterous hands is expensive and labor-intensive. Human manipulation videos offer scale but the embodiment gap between human and robotic dexterous hands makes direct pretraining extremely challenging.

Technical Approach: DexImit automatically converts monocular human manipulation videos into physically plausible robot data through a four-stage pipeline: (1) reconstruct hand-object interactions from arbitrary viewpoints with near-metric scale; (2) subtask decomposition and bimanual scheduling; (3) synthesize robot trajectories consistent with demonstrated interactions; (4) comprehensive data augmentation for zero-shot real-world deployment. The pipeline handles internet videos and video generation model outputs.

Key Takeaway: DexImit handles diverse manipulation tasks including tool use (apple cutting), long-horizon tasks (making a beverage), and fine-grained manipulations (stacking cups), providing a scalable pipeline for generating large-scale dexterous robot training data from human videos.

研究背景: 資料匱乏根本限制了雙臂靈巧操作的泛化——靈巧手的真實世界資料收集昂貴且耗時。人類操作影片提供規模,但人類與機器人靈巧手之間的機器人形體差距使直接預訓練極具挑戰性。

技術方法: DexImit 透過四階段管線自動將單目人類操作影片轉換為物理上合理的機器人資料:(1) 從任意視角重建具有近似公制尺度的手-物互動;(2) 子任務分解和雙臂調度;(3) 合成與示範互動一致的機器人軌跡;(4) 全面的資料增強用於零樣本真實部署。

核心發現: DexImit 處理多樣化操作任務,包括工具使用(切蘋果)、長視野任務(調製飲料)和細粒度操作(疊杯子),為從人類影片生成大規模靈巧機器人訓練資料提供可擴展管線。


How Do We Research Human-Robot Interaction in the Age of Large Language Models? A Systematic Review

Authors: Yufeng Wang, Yuan Xu, Anastasia Nikolova, et al. | Submitted: 2026-02-13 | arXiv: 2602.15063 Categories: cs.RO, cs.HC

Research Background: LLMs are reshaping HRI, but while technical potential has been highlighted, there is no systematic examination of their human-centered impact — how they affect human-oriented understanding, user modeling, and autonomy levels in deployed HRI systems.

Technical Approach: A systematic literature review following PRISMA guidelines identified 86 articles meeting inclusion criteria. The analysis reveals two key findings: (1) LLMs are transforming HRI fundamentals by reshaping robot sensing of context, socially grounded interaction generation, and continuous alignment with human needs in embodied settings; (2) current research is largely exploratory, with wide-ranging choices of experimental setups, evaluation metrics, and study methods.

Key Takeaway: The field lacks standardized evaluation frameworks for LLM-driven HRI systems, and future research must prioritize human-centered metrics — user modeling, trust, autonomy calibration — alongside technical performance benchmarks.

研究背景: LLM 正在重塑 HRI,但儘管技術潛力已獲強調,目前尚無對其以人為中心影響的系統性研究——它們如何影響以人為導向的理解、用戶建模,以及部署 HRI 系統中的自主層級。

技術方法: 遵循 PRISMA 指南的系統性文獻回顧識別了 86 篇符合納入標準的文章。分析揭示兩個關鍵發現:(1) LLM 正在透過重塑機器人的情境感知、社會基礎互動生成,以及在具身環境中與人類需求的持續對齊,改變 HRI 基礎;(2) 目前研究很大程度上仍在探索階段,實驗設置、評估指標和研究方法選擇各異。

核心發現: 領域缺乏 LLM 驅動 HRI 系統的標準化評估框架,未來研究必須在技術效能基準之外優先考慮以人為中心的指標——用戶建模、信任度和自主性校準。


Ontological Grounding for Sound and Natural Robot Explanations via Large Language Models

Authors: Alberto Olivares-Alarcos, Muhammad Ahsan, Satrio Sanjaya, et al. | Submitted: 2026-02-14 | arXiv: 2602.13800 Categories: cs.RO, cs.HC

Research Background: Effective HRI requires robots to produce explanations that are both logically sound and communicated naturally. LLMs generate fluent language but lack semantic grounding; ontology-based systems ensure consistency but produce rigid, unnatural output.

Technical Approach: A hybrid framework blends ontology-based reasoning with LLMs: ontologies ensure logical consistency and domain grounding by enabling robots to reason about typical versus atypical events, while LLMs provide fluent, context-aware, adaptive language generation. A state-of-the-art algorithm retrieves and constructs static contrastive ontology-based narratives that an LLM agent converts into concise, interactive explanations.

Key Takeaway: The system significantly improves clarity and brevity of ontology-based narratives while preserving semantic accuracy, and adapts explanations to user feedback — advancing explainable robot agency for transparent human-robot collaboration.

研究背景: 有效的 HRI 要求機器人產生既邏輯合理又自然傳達的解釋。LLM 生成流暢語言但缺乏語意基礎;基於本體論的系統確保一致性但產生僵硬、不自然的輸出。

技術方法: 混合框架將基於本體論的推理與 LLM 融合:本體論透過使機器人能夠推理典型與非典型事件確保邏輯一致性和領域基礎,而 LLM 提供流暢、情境感知、自適應的語言生成。最先進演算法檢索並構建靜態對比性本體論敘述,LLM 智能體將其轉換為簡潔的互動式解釋。

核心發現: 系統在保留語意準確性的同時顯著改善基於本體論敘述的清晰度和簡潔性,並能根據用戶反饋調整解釋——推進了透明人機協作的可解釋機器人能動性。


Disambiguating Anthropomorphism and Anthropomimesis in Human-Robot Interaction

Authors: Minja Axelsson, Henry Shevlin | Submitted: 2026-02-10 | arXiv: 2602.09287 Categories: cs.RO, cs.HC

Research Background: The concepts of anthropomorphism and anthropomimesis are often conflated in HRI and social robotics literature, creating theoretical confusion about which party — user or designer — is responsible for human-like qualities in robots.

Technical Approach: This theoretical paper proposes clear definitions: anthropomorphism refers to users perceiving human-like qualities in robots (a cognitive process in the perceiver), while anthropomimesis refers to robot developers intentionally designing human-like features into robots (a design decision by the creator). The disambiguation provides a conceptual framework for future scholarship.

Key Takeaway: Distinguishing the perceiver-driven and designer-driven aspects of human-likeness in robots enables more precise experimental design, evaluation, and ethical analysis in HRI research.

研究背景: 擬人主義和擬人模仿的概念在 HRI 和社交機器人文獻中常被混淆,關於使用者還是設計師應對機器人的類人特質負責產生理論混亂。

技術方法: 這篇理論論文提出清晰定義:擬人主義指用戶感知機器人的類人特質(感知者的認知過程),擬人模仿指機器人開發者有意將類人特徵設計進機器人(創作者的設計決策)。這種消歧提供了未來研究的概念框架。

核心發現: 區分機器人類人特性中的感知者驅動和設計者驅動層面,使 HRI 研究中的實驗設計、評估和倫理分析更加精確。


A Latency-Aware Framework for Visuomotor Policy Learning on Industrial Robots

Authors: Daniel Ruan, Salma Mozaffari, Sigrid Adriaenssens, Arash Adel | Submitted: 2026-02-15 | arXiv: 2602.14255 Categories: cs.RO

Research Background: Industrial robots face much larger observation-execution gaps than research robots due to high-level interfaces and slower closed-loop dynamics, making latency a critical system-level issue for deploying learned visuomotor policies in contact-rich manufacturing tasks.

Technical Approach: The framework integrates calibrated multimodal sensing, temporally consistent synchronization, a unified communication pipeline, and a teleoperation interface for demonstration collection. A latency-aware execution strategy schedules finite-horizon policy-predicted action sequences based on temporal feasibility, enabling asynchronous inference and execution without modifying policy architectures or training — applicable to any existing policy.

Key Takeaway: Latency-aware execution maintains smooth motion, compliant contact behavior, and consistent task progression across a wide range of latencies while reducing idle time, providing a practical deployment framework for visuomotor policies on industrial robots.

研究背景: 工業機器人因高層介面和較慢的閉迴路動態,面臨比研究機器人大得多的觀測-執行差距,使延遲成為在接觸豐富製造任務中部署學習型視覺運動策略的關鍵系統層級問題。

技術方法: 框架整合校準的多模態感知、時序一致同步、統一通訊管線和示範收集的遙操作介面。延遲感知執行策略根據時序可行性調度有限視野策略預測的動作序列,無需修改策略架構或訓練即可實現非同步推理和執行。

核心發現: 延遲感知執行在廣泛延遲範圍內保持平滑運動、順應接觸行為和一致任務進展,同時減少空閒時間,為工業機器人上的視覺運動策略提供實用部署框架。


HybridFlow: A Two-Step Generative Policy for Robotic Manipulation

Authors: Zhenchen Dong, Jinna Fu, Jiaming Wu, et al. | Submitted: 2026-02-14 | arXiv: 2602.13718 Categories: cs.RO, cs.AI

Research Background: Inference latency limits real-time interaction capability for robot manipulation policies. While flow matching is replacing diffusion methods for speed, robotics demands even faster generation without sacrificing action precision.

Technical Approach: HybridFlow proposes a 3-stage method with 2 Neural Function Evaluations (NFE): Global Jump using MeanFlow one-step generation, ReNoise for distribution alignment, and Local Refine using ReFlow for precision. This hybrid exploits MeanFlow’s one-step speed advantage while ensuring action quality with minimal steps — reducing inference time from 152ms to 19ms (8x speedup, ~52Hz) compared to 16-step Diffusion Policy.

Key Takeaway: HybridFlow outperforms 16-step Diffusion Policy by 15-25% in success rate at 8x faster inference (~52Hz), and achieves 70.0% success on unseen-color OOD grasping and 66.3% on deformable object folding, establishing it as a practical low-latency policy for interactive manipulation.

研究背景: 推論延遲限制了機器人操作策略的即時互動能力。雖然流匹配正以速度取代擴散方法,但機器人領域需要在不犧牲動作精度的前提下實現更快的生成。

技術方法: HybridFlow 提出具有 2 個神經函數評估(NFE)的三階段方法:使用 MeanFlow 單步生成的全局跳躍、用於分佈對齊的 ReNoise,以及使用 ReFlow 進行精度的局部精煉。這種混合方法利用 MeanFlow 的單步速度優勢,同時以最少步驟確保動作品質——相比 16 步擴散策略將推論時間從 152ms 降至 19ms(8 倍加速,約 52Hz)。

核心發現: HybridFlow 在 8 倍更快推論速度(約 52Hz)下,成功率比 16 步擴散策略高 15-25%,在未見顏色 OOD 抓取上達到 70.0% 成功率,可變形物體折疊達到 66.3%,確立了作為互動操作實用低延遲策略的地位。


Semantic-Contact Fields for Category-Level Generalizable Tactile Tool Manipulation

Authors: Kevin Yuchen Ma, Heng Zhang, Weisi Lin, et al. | Submitted: 2026-02-14 | arXiv: 2602.13833 Categories: cs.RO

Research Background: Tool manipulation requires both semantic planning and precise physical contact control, but VLA models lack high-fidelity physical grounding while contact-aware policies are typically instance-specific and fail to generalize across diverse tool geometries.

Technical Approach: Semantic-Contact Fields (SCFields) fuse visual semantics with dense contact estimates in a unified 3D representation. A two-stage sim-to-real pipeline first pretrains on large simulation data to learn general contact physics, then finetunes on a small real dataset pseudo-labeled via geometric heuristics and force optimization. SCFields serve as dense observation inputs to a diffusion policy for contact-rich tool tasks.

Key Takeaway: SCFields achieves robust category-level generalization on scraping, crayon drawing, and peeling tasks, significantly outperforming vision-only and raw-tactile baselines — demonstrating that semantically grounded contact representations can bridge the gap between VLA generalization and contact-rich precision.

研究背景: 工具操作同時需要語意規劃和精確物理接觸控制,但 VLA 模型缺乏高保真物理基礎,而接觸感知策略通常特定於單個實例且無法在不同工具幾何形狀間泛化。

技術方法: 語意接觸場(SCFields)在統一 3D 表示中融合視覺語意與密集接觸估計。兩階段模擬到真實管線先在大型模擬資料集上預訓練學習一般接觸物理,再在透過幾何啟發式和力優化偽標記的少量真實資料上微調。SCFields 作為擴散策略的密集觀測輸入用於接觸豐富的工具任務。

核心發現: SCFields 在刮削、蠟筆繪畫和剝皮任務上實現強健的類別級泛化,顯著優於僅視覺和原始觸覺基線,說明語意基礎的接觸表示可以填補 VLA 泛化與接觸豐富精度之間的差距。


CLOT: Closed-Loop Global Motion Tracking for Whole-Body Humanoid Teleoperation

Authors: Tengjie Zhu, Guanyu Cai, Yang Zhaohui, et al. | Submitted: 2026-02-13 | arXiv: 2602.15060 Categories: cs.RO, cs.AI

Research Background: Long-horizon whole-body humanoid teleoperation accumulates global pose drift because learning-based tracking methods operate in the robot’s local frame and neglect global pose feedback, causing instability over extended execution.

Technical Approach: CLOT achieves drift-free human-to-humanoid mimicry via high-frequency localization feedback that synchronizes operator and robot poses in a closed loop. To prevent the aggressive corrections that direct global tracking rewards cause in RL, CLOT uses a data-driven randomization strategy that decouples observation trajectories from reward evaluation, and regularizes with an adversarial motion prior to suppress unnatural behaviors. A transformer-based policy trained for 1300+ GPU hours on 20 hours of curated human motion data is deployed on a 31-DoF full-sized humanoid.

Key Takeaway: CLOT demonstrates high-dynamic motion, high-precision tracking, and strong robustness in sim-to-real humanoid teleoperation, solving the global pose drift problem for sustained real-world whole-body teleoperation.

研究背景: 長視野全身人形機器人遙操作會積累全局姿態漂移,因為基於學習的追蹤方法在機器人局部坐標系中運作並忽略全局姿態反饋,導致長時間執行後的不穩定性。

技術方法: CLOT 透過高頻定位反饋在閉迴路中同步操作員和機器人姿態,實現無漂移的人機模仿。為防止直接全局追蹤獎勵在 RL 中造成的激進校正,CLOT 使用解耦觀測軌跡與獎勵評估的資料驅動隨機化策略,並以對抗性運動先驗正則化以抑制不自然行為。基於 Transformer 的策略在 20 小時精心策劃的人類運動資料上訓練超過 1300 GPU 小時,部署在 31 自由度全尺寸人形機器人上。

核心發現: CLOT 在模擬到真實人形機器人遙操作中展示了高動態運動、高精度追蹤和強健魯棒性,解決了持續真實世界全身遙操作的全局姿態漂移問題。


MOSAIC: Bridging the Sim-to-Real Gap in Generalist Humanoid Motion Tracking and Teleoperation with Rapid Residual Adaptation

Authors: Zhenguo Sun, Bo-Sheng Huang, Yibo Peng, et al. | Submitted: 2026-02-09 | arXiv: 2602.08594 Categories: cs.RO

Research Background: Generalist humanoid motion trackers trained in simulation remain brittle on hardware during sustained teleoperation due to interface- and dynamics-induced errors that standard sim-to-real transfer methods do not adequately address.

Technical Approach: MOSAIC learns a teleoperation-oriented general motion tracker via RL on a multi-source motion bank with adaptive resampling and world-frame motion consistency rewards. To bridge the sim-to-real interface gap without sacrificing generality, it performs rapid residual adaptation: an interface-specific policy trained on minimal data is distilled into the general tracker through an additive residual module — outperforming naive finetuning or continual learning approaches.

Key Takeaway: MOSAIC achieves robust offline motion replay and online long-horizon teleoperation under realistic latency and noise conditions, with the additive residual distillation approach providing a principled and efficient path to interface-specific sim-to-real adaptation.

研究背景: 在模擬中訓練的通用人形機器人運動追蹤器在持續遙操作期間,因介面和動態引起的誤差而在硬體上仍然脆弱,標準的模擬到真實遷移方法無法充分解決這些問題。

技術方法: MOSAIC 透過在多源運動庫上進行 RL,以自適應重採樣和世界坐標系運動一致性獎勵,學習以遙操作為導向的通用運動追蹤器。為不犧牲通用性地填補模擬到真實介面差距,它執行快速殘差自適應:在最少資料上訓練的介面特定策略通過加法殘差模組蒸餾到通用追蹤器中——優於樸素微調或持續學習方法。

核心發現: MOSAIC 在現實延遲和噪聲條件下實現了強健的離線運動重放和在線長視野遙操作,加法殘差蒸餾方法為介面特定的模擬到真實自適應提供了有原則且高效的路徑。