arXiv Weekly Digest — Week 07, 2026

Fetched: 2026-02-08 | Categories: cs.RO, cs.LG, cs.HC, cs.CV | Papers: 15

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

Authors: Yalcin Tur, Jalal Naghiyev, Haoquan Fang et al. | Submitted: 2026-02-08 | arXiv: 2602.07845 Categories: cs.RO

Research Background: Current VLA models allocate the same computation to trivial adjustments and complex multi-step manipulation — an inefficiency that limits performance on hard tasks. Test-time scaling via Chain-of-Thought is memory-intensive and poorly suited for continuous action spaces. 研究背景： 現有 VLA 模型對簡單調整與複雜操作分配相同計算量，Chain-of-Thought 的 token 生成方式對連續動作空間不適合，也會線性增加記憶體消耗。

Technical Approach: RD-VLA introduces a recurrent, weight-tied action head that performs latent iterative refinement at inference time. The model is trained with truncated backpropagation through time (TBPTT) and uses an adaptive stopping criterion based on latent convergence — allocating more compute only when the task demands it, without increasing memory. 技術方法： RD-VLA 採用循環式、權重共享的動作頭，在推理時進行潛在空間迭代精煉。透過截斷時間反向傳播（TBPTT）訓練，並以潛在收斂為自適應停止條件，在不增加記憶體的前提下動態分配計算。

Key Takeaway: Tasks that completely fail (0% success) with single-iteration inference surpass 90% success with four iterations, while achieving up to 80x speedup over prior reasoning-based VLA models. 核心發現： 單步推理完全失敗（0%）的任務，在四次迭代後成功率超過 90%，且比先前推理型 VLA 模型快達 80 倍。

Differentiate-and-Inject: Enhancing VLAs via Functional Differentiation Induced by In-Parameter Structural Reasoning

Authors: Jingyi Hou, Leyu Zhou, Chenchen Jing et al. | Submitted: 2026-02-07 | arXiv: 2602.07541 Categories: cs.RO

Research Background: VLA models struggle with task-level structural reasoning — they either rely on unstable prompt-based decomposition or require large-scale end-to-end training that entangles planning and control. A principled way to embed task structure into model parameters remains an open problem. 研究背景： VLA 模型難以進行任務層級的結構化推理：prompt 分解不穩定，端對端訓練又將規劃與控制糾纏在一起，如何將任務結構內化到參數中仍是未解問題。

Technical Approach: iSTAR embeds implicit dynamic scene-graph knowledge — capturing object relations, subtask semantics, and task-level dependencies — directly into model parameters. This enables functional differentiation between task-level inference and low-level control without external planners or handcrafted prompts. 技術方法： iSTAR 將隱式動態場景圖知識（物件關係、子任務語意、任務依賴）直接嵌入模型參數，實現任務層級推理與低層控制的功能分離，無需外部規劃器或手工 prompt。

Key Takeaway: Parameter-space structural reasoning yields more reliable task decompositions and higher success rates than both in-context and end-to-end VLA baselines across diverse manipulation benchmarks. 核心發現： 參數空間結構推理在多樣操作基準上，相比 in-context 與端對端 VLA 基線，均能實現更可靠的任務分解與更高的成功率。

LIBERO-X: Robustness Litmus for Vision-Language-Action Models

Authors: Guodong Wang, Chenkai Zhang, Qingjie Liu et al. | Submitted: 2026-02-06 | arXiv: 2602.06556 Categories: cs.CV, cs.AI, cs.RO

Research Background: Existing VLA benchmarks provide limited or misleading assessments because they fail to capture real-world distribution shifts in spatial layout, object appearance, and instruction variation. This hampers honest evaluation of VLA progress. 研究背景： 現有 VLA 基準因評估協議不足，無法反映現實世界的分佈偏移，導致對模型能力的評估過於樂觀。

Technical Approach: LIBERO-X introduces a hierarchical evaluation protocol with progressive difficulty across three core capabilities: spatial generalization, object recognition, and task instruction understanding. It also provides a high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives. 技術方法： LIBERO-X 提出分層評估協議，按空間泛化、物體識別、指令理解三項能力逐步加難，並配套以人工遙操作蒐集的高多樣性訓練資料集。

Key Takeaway: Representative VLA models show significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. 核心發現： 主流 VLA 模型在累積擾動下出現顯著性能下滑，暴露其在場景理解與指令接地方面的根本局限。

Action Hallucination in Generative Visual-Language-Action Models

Authors: Harold Soh, Eugene Lim | Submitted: 2026-02-06 | arXiv: 2602.06339 Categories: cs.RO, cs.AI

Research Background: Generative VLA models are rapidly replacing hand-designed robot planners, but whether they truly resolve fundamental robotics challenges — especially the generation of physically infeasible actions — remains unclear. 研究背景： 生成式 VLA 模型正迅速取代傳統規劃器，但它們是否能根本解決機器人學的挑戰——尤其是生成物理上不可行的動作——仍不明朗。

Technical Approach: The authors analyze action hallucinations through the lens of latent-variable generative policies, identifying three structural barriers — topological, precision, and horizon — that cause unavoidable tradeoffs between expressiveness and physical feasibility. Each barrier is analyzed mechanistically to explain empirical failure patterns. 技術方法： 從潛變數生成策略的角度分析動作幻覺，識別三種結構性障礙——拓撲、精確度與時域——這些障礙在表達能力與物理可行性之間製造了不可避免的取捨。

Key Takeaway: Action hallucinations are not implementation bugs but structural consequences of current model architectures, suggesting principled fixes are needed beyond scaling data alone. 核心發現： 動作幻覺並非實作缺陷，而是當前模型架構的結構性後果，僅靠擴大資料規模無法根本解決。

Authors: Yuxuan Hu, Xiangyu Chen, Chuhao Zhou et al. | Submitted: 2026-02-07 | arXiv: 2602.07388 Categories: cs.RO

Research Background: In long-horizon manipulation tasks, visually identical observations can appear at different execution stages requiring different actions — a problem called multi-modal action ambiguity (MA2). Policies conditioned only on current observations cannot disambiguate these situations. 研究背景： 在長時域操作任務中，視覺上相同的觀測可能對應不同執行階段的不同動作，形成多模態動作歧義（MA2）。僅依賴當前觀測的策略無法解決此問題。

Technical Approach: Trace-Focused Diffusion Policy (TF-DP) conditions action generation on an explicit execution trace — the robot’s motion history projected into the visual observation space. A trace-focused attention field highlights task-relevant regions associated with historical motion, improving both disambiguation and robustness to background distractors. 技術方法： TF-DP 將機器人的歷史運動軌跡投影至視覺觀測空間，作為動作生成的明確條件，並透過軌跡聚焦注意力場突顯與歷史運動相關的任務區域。

Key Takeaway: TF-DP outperforms vanilla diffusion policy by 80.56% on multi-modal action ambiguity tasks and 86.11% under visual disturbances, with only 6.4% runtime overhead. 核心發現： TF-DP 在多模態動作歧義任務上超越基線 80.56%，視覺干擾條件下超越 86.11%，推理時間僅增加 6.4%。

Force Generative Imitation Learning: Bridging Position Trajectory and Force Commands through Control Technique

Authors: Hiroshi Sato, Sho Sakaino, Toshiaki Tsuji | Submitted: 2026-02-06 | arXiv: 2602.06620 Categories: cs.RO, eess.SY

Research Background: Contact-rich robot tasks require precise force control, but force commands are hardware-specific and difficult to obtain from demonstrations. Position trajectories are far easier to collect, leaving a gap between available data and force control needs. 研究背景： 接觸密集型機器人任務需要精確的力控制，但力指令具硬體依賴性且難以從示範中取得。位置軌跡容易蒐集，但與力控需求之間存在落差。

Technical Approach: A force generative model estimates force commands from given position trajectories. To handle unseen trajectories, the authors incorporate a feedback control mechanism and find that models without memory enable stable convergence while models with memory cause feedback instability. 技術方法： 提出力生成模型，從位置軌跡估計力指令。針對未見軌跡引入回饋控制機制，並發現無記憶模型能實現穩定收斂，而有記憶模型則導致回饋不穩定。

Key Takeaway: A memoryless force generative model combined with feedback control effectively generalizes force command generation to unseen position trajectories for real-world writing tasks. 核心發現： 無記憶力生成模型結合回饋控制，能有效將力指令生成泛化至未見位置軌跡，成功應用於真實書寫任務。

Beyond the Majority: Long-tail Imitation Learning for Robotic Manipulation

Authors: Junhong Zhu, Ji Zhang, Jingkuan Song et al. | Submitted: 2026-02-06 | arXiv: 2602.06512 Categories: cs.RO

Research Background: Generalist robot policies learned from demonstration data suffer from long-tail distribution imbalance — a few data-rich head tasks dominate training, leaving data-scarce tail tasks underperforming. Standard long-tail learning remedies (resampling, reweighting) fail to address this in robotics. 研究背景： 從示範資料學習的通用機器人策略受長尾分佈影響嚴重：少數資料豐富的頭部任務主導訓練，導致資料稀少的尾部任務表現不佳。常規長尾學習策略在機器人學中無效。

Technical Approach: The authors diagnose that data scarcity on tail tasks specifically impairs spatial reasoning capability. They introduce Approaching-Phase Augmentation (APA), which transfers manipulation knowledge from head tasks to tail tasks by augmenting the approaching phase of demonstrations without requiring additional data. 技術方法： 研究發現尾部任務資料匱乏直接損害空間推理能力，提出接近階段增強（APA），無需額外示範即可將頭部任務的操作知識轉移至尾部任務。

Key Takeaway: APA significantly improves tail-task performance in both simulation and real-world experiments without degrading head-task accuracy. 核心發現： APA 在模擬與真實操作實驗中均顯著提升尾部任務性能，且不損害頭部任務準確率。

RoboPaint: From Human Demonstration to Any Robot and Any View

Authors: Jiacheng Fan, Zhiyue Zhao, Yiqian Zhang et al. | Submitted: 2026-02-05 | arXiv: 2602.05325 Categories: cs.RO

Research Background: Scaling VLA models for dexterous manipulation requires large robot demonstration datasets, but direct teleoperation is expensive and hard to scale. Human demonstrations are cheaper to collect but require bridging the morphological gap between human hands and robot end-effectors. 研究背景： 為靈巧操作的 VLA 模型建立大規模資料集，需要大量機器人示範，但直接遙操作成本高且難以擴展。人類示範雖易蒐集，但需解決人手與機器手的形態差距。

Technical Approach: RoboPaint proposes a Real-Sim-Real pipeline: standardized rooms capture synchronized multimodal human demonstrations (RGB-D, glove joints, tactile), followed by tactile-aware retargeting to map human hand states to robot dex-hand states via geometry and force-guided optimization, then photorealistic rendering in Isaac Sim. 技術方法： RoboPaint 提出 Real-Sim-Real 流程：標準化場景捕捉同步多模態人類示範（RGB-D、手套關節角、觸覺），接著透過幾何與力引導優化進行觸覺感知的動作重定向，再於 Isaac Sim 中進行擬真渲染。

Key Takeaway: VLA policies (Pi0.5) trained solely on RoboPaint-generated data achieve 80% average success on three representative tasks, with the retargeting pipeline achieving 84% success across ten diverse object manipulation tasks. 核心發現： 僅以 RoboPaint 生成資料訓練的 VLA 策略（Pi0.5）在三項代表性任務上平均成功率達 80%，動作重定向流程在十種物件操作任務上成功率達 84%。

Feasibility-Guided Planning over Multi-Specialized Locomotion Policies

Authors: Ying-Sheng Luo, Lu-Ching Wang, Hanjaya Mandala et al. | Submitted: 2026-02-08 | arXiv: 2602.07932 Categories: cs.RO

Research Background: Legged robots benefit from specialized locomotion policies trained for specific terrain types, but composing multiple expert policies for planning over unstructured terrain is an open challenge. Traditional planners cannot handle skill-specific dynamics, while hierarchical RL loses interpretability. 研究背景： 足式機器人可從針對特定地形訓練的專業化策略中受益，但在非結構化地形上組合多個專家策略進行規劃仍是開放問題。傳統規劃器無法處理技能特定動態，而分層 RL 又失去可解釋性。

Technical Approach: Each terrain-specific policy is paired with a Feasibility-Net that predicts a feasibility tensor from local elevation maps and task vectors. Classical planning algorithms use these tensors to derive optimal paths, enabling modular composition of specialized policies without retraining when new ones are added. 技術方法： 每個地形特化策略搭配一個可行性網路（Feasibility-Net），根據局部高度圖和任務向量預測可行性張量，供經典規劃算法推導最優路徑，新增策略無需重訓。

Key Takeaway: The framework efficiently generates reliable plans across diverse challenging terrains in both simulation and real-world tests, while remaining interpretable and modular. 核心發現： 框架在模擬與真實測試中均能高效生成跨越多樣複雜地形的可靠路徑，同時保持可解釋性與模組化。

Humanoid Manipulation Interface: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations

Authors: Ruiqian Nai, Boyuan Zheng, Junming Zhao et al. | Submitted: 2026-02-06 | arXiv: 2602.06643 Categories: cs.RO, cs.AI, cs.LG

Research Background: Learning whole-body manipulation for humanoids currently relies on teleoperation (hardware-dependent) or visual sim-to-real RL (complex reward engineering). Both approaches limit the diversity and scalability of demonstrated skills, particularly in uncontrolled environments. 研究背景： 人形機器人全身操作學習目前依賴遙操作（受硬體限制）或視覺 sim-to-real 強化學習（需複雜獎勵工程），兩者均限制技能多樣性與可擴展性。

Technical Approach: HuMI uses portable hardware to capture rich whole-body human motion (robot-free), then feeds it into a hierarchical learning pipeline that translates human motions into feasible humanoid skills. The pipeline supports five task types including kneeling, squatting, tossing, walking, and bimanual manipulation. 技術方法： HuMI 以可攜帶硬體捕捉豐富的人類全身動作（無需機器人），再透過分層學習流程將人類動作轉化為可行的人形機器人技能，支援五種全身任務類型。

Key Takeaway: HuMI achieves 3x data collection efficiency compared to teleoperation and 70% success in unseen environments, demonstrating scalable robot-free learning for diverse whole-body tasks. 核心發現： HuMI 資料蒐集效率是遙操作的 3 倍，在未見環境中成功率達 70%，展示了可擴展的無機器人全身任務學習。

Scalable and General Whole-Body Control for Cross-Humanoid Locomotion

Authors: Yufei Xue, YunFeng Lin, Wentao Dong et al. | Submitted: 2026-02-05 | arXiv: 2602.05791 Categories: cs.RO

Research Background: Learning-based whole-body controllers for humanoids typically require robot-specific training, making them expensive to redeploy across different platforms. Cross-embodiment generalization in humanoid control remains largely unsolved. 研究背景： 人形機器人的學習式全身控制器通常需針對特定機器人訓練，跨平台部署成本高昂，跨形態泛化仍是未解問題。

Technical Approach: XHugWBC enables cross-embodiment humanoid control through three innovations: physics-consistent morphological randomization during training, semantically aligned observation and action spaces across diverse robots, and policy architectures that explicitly model morphological and dynamical properties to learn motion priors from a broad distribution of embodiments. 技術方法： XHugWBC 透過三項創新實現跨形態人形控制：訓練時的物理一致形態隨機化、跨機器人語義對齊的觀測與動作空間，以及建模形態與動力學屬性的策略架構。

Key Takeaway: A single XHugWBC policy generalizes zero-shot to twelve simulated humanoids and seven real-world robots, demonstrating that cross-embodiment training can replace robot-specific controllers. 核心發現： 單一 XHugWBC 策略可零樣本泛化至十二種模擬人形機器人和七種真實機器人，證明跨形態訓練能取代特化控制器。

“Meet My Sidekick!”: Effects of Separate Identities and Control of a Single Robot in HRI

Authors: Drake Moore, Arushi Aggarwal, Emily Taylor et al. | Submitted: 2026-02-07 | arXiv: 2602.07598 Categories: cs.RO

Research Background: A robot’s presented identity and capability directly shape human trust and collaboration. Unlike humans, a single physical robot can simultaneously embody multiple identities controlling different subsystems — a largely unexplored design space in HRI. 研究背景： 機器人呈現的身份與能力直接影響人類信任與協作。不同於人類，單一實體機器人可同時展現多個身份控制不同子系統，這在 HRI 設計空間中幾乎未被探索。

Technical Approach: A mixed-design study exposed participants to one of three presentations of the same robot: single unified identity, co-embodiment (two agents with shared full control), or split-embodiment (two agents each controlling distinct domains — head and gripper). Three tasks probed motivational support, isolated failures, and collaborative failures. 技術方法： 混合設計研究讓參與者體驗同一機器人的三種呈現方式：單一身份、共同具身（兩個代理共享完整控制）、分離具身（兩個代理各自控制頭部或抓取器），三項任務分別測試動機支持、孤立失敗和協作失敗場景。

Key Takeaway: Participants associated robot failures with specific identities and perceived the robot as residing in distinct control domains, suggesting that split-embodiment configurations can distribute trust and accountability across a single robot body. 核心發現： 參與者能將機器人失誤歸因於特定身份，並感知機器人存在於不同控制域中，顯示分離具身設計可在單一機器人體內分散信任與責任歸因。

Scalable Dexterous Robot Learning with AR-based Remote Human-Robot Interactions

Authors: Yicheng Yang, Ruijiao Li, Lifeng Wang et al. | Submitted: 2026-02-07 | arXiv: 2602.07341 Categories: cs.LG, cs.RO

Research Background: Collecting expert demonstration data for dexterous robot arm-hand systems at scale is a bottleneck for robot learning. Traditional teleoperation setups require physical proximity and expensive hardware, limiting scalability. 研究背景： 為靈巧機器人臂-手系統大規模蒐集專家示範資料是機器人學習的瓶頸。傳統遙操作需要物理接近和昂貴硬體，限制了可擴展性。

Technical Approach: The framework uses AR-based remote human-robot interaction for scalable demonstration collection. A two-phase approach first pre-trains a behavior cloning (BC) policy from the collected data, then applies contrastive learning-empowered RL with an event-driven augmented reward to produce a more robust final policy. 技術方法： 框架以 AR 遠端人機互動進行可擴展的示範蒐集。兩階段方法先從蒐集資料進行行為克隆（BC）預訓練，再以對比學習強化的 RL 搭配事件驅動增強獎勵進一步提升策略魯棒性。

Key Takeaway: The contrastive learning-augmented RL phase overcomes BC policy collapse and significantly improves manipulation success rates compared to standard PPO and SAC baselines. 核心發現： 對比學習增強的 RL 階段克服了 BC 策略崩潰問題，相比標準 PPO 和 SAC 基線顯著提升操作成功率。

Signal or ‘Noise’: Human Reactions to Robot Errors in the Wild

Authors: Maia Stiber, Sameer Khan, Russell Taylor et al. | Submitted: 2026-02-04 | arXiv: 2602.05010 Categories: cs.RO, cs.HC

Research Background: Social signals have been shown to be useful for robot error management in controlled lab settings, but whether humans reliably produce actionable social signals when interacting with robots in uncontrolled, real-world environments — especially in group contexts with repeated errors — is unknown. 研究背景： 在受控實驗環境中，社交信號對機器人錯誤管理有幫助，但在真實非控制環境中（尤其是群體互動與重複錯誤場景），人類是否能產生可靠的可操作社交信號仍屬未知。

Technical Approach: The researchers built a coffee robot and conducted a public field deployment with 49 participants. They analyzed varied social signals (verbal, gestural, proxemic) produced in response to robot errors and other stimuli across individual and group interactions. 技術方法： 研究者建造一台咖啡機器人並進行公開現場部署，招募 49 名參與者，分析個人與群體互動中對機器人錯誤及其他刺激所產生的多元社交信號。

Key Takeaway: Real-world social signals are rich and informative but inherently “noisy,” with participants volunteering unsolicited information — highlighting both the promise and the challenge of leveraging social signals for real-world HRI error management. 核心發現： 真實世界的社交信號豐富且具資訊性，但本質上充滿「雜訊」，且參與者會主動提供未請求的資訊，顯示社交信號用於現場 HRI 錯誤管理既具潛力又充滿挑戰。

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Authors: Shenyuan Gao, William Liang, Kaiyuan Zheng et al. | Submitted: 2026-02-06 | arXiv: 2602.06949 Categories: cs.RO, cs.AI, cs.CV, cs.LG

Research Background: Generalist robot world models capable of simulating action outcomes across diverse environments are critical for scalable robot learning, but training them is hindered by limited robot data coverage and scarce action labels from egocentric video datasets. 研究背景： 能夠在多樣環境中模擬動作結果的通用機器人世界模型對可擴展機器人學習至關重要，但機器人資料覆蓋有限且自我視角影片缺乏動作標籤，使訓練極具挑戰。

Technical Approach: DreamDojo pre-trains on 44k hours of egocentric human videos — the largest video dataset for world model pre-training to date. To handle the absence of action labels, continuous latent actions serve as unified proxy actions. A distillation pipeline accelerates inference to 10.81 FPS and improves temporal consistency. Post-training on small-scale robot data enables precise action controllability. 技術方法： DreamDojo 在 44,000 小時自我視角人類影片（迄今世界模型預訓練最大資料集）上預訓練，以連續潛在動作作為統一代理動作解決標籤缺失問題，蒸餾流程加速至 10.81 FPS 並提升時序一致性，少量機器人資料後訓練實現精確動作可控性。

Key Takeaway: DreamDojo enables live teleoperation, policy evaluation, and model-based planning, and demonstrates strong generalization on out-of-distribution benchmarks for contact-rich tasks. 核心發現： DreamDojo 支援即時遙操作、策略評估與模型規劃，並在接觸密集型任務的分布外基準上展現強勁泛化能力。

bot_vault

Explorer

arXiv Digest — 2026-W07