arXiv Weekly Digest — Week 15, 2026

Fetched: 2026-04-06 | Categories: cs.RO, cs.LG, cs.HC, cs.CV | Papers: 53

Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

Authors: Yuanchang Liang, Xiaobo Wang, Kai Wang et al. | Submitted: 2026-04-05 | arXiv: 2604.04161 Categories: cs.RO

Research Background: Action chunking in VLA models trades responsiveness for smoothness — too large a chunk causes stale plans, too small causes jerky behavior. Selecting the right chunk size dynamically remains an unsolved problem.
研究背景： VLA 模型的動作分塊（action chunking）面臨兩難：分塊太大導致對新資訊反應遲鈍，太小則造成不連貫的跳動行為。如何動態選擇最佳分塊大小仍是未解問題。

Technical Approach: The proposed Adaptive Action Chunking (AAC) strategy uses action entropy as a real-time signal to determine chunk size during inference. When entropy is low (the model is confident), larger chunks are used; when entropy is high, smaller chunks trigger more frequent replanning. This avoids committing to a fixed empirical chunk size across all tasks.
技術方法： 提出的自適應動作分塊（AAC）策略以動作熵作為即時信號，在推理時動態決定分塊大小。熵低（模型置信度高）時使用較大分塊，熵高時使用較小分塊以更頻繁地重新規劃。此方法避免了跨任務固定分塊長度的做法。

Key Takeaway: Using action entropy as a proxy for uncertainty enables VLA models to adaptively balance reactivity and consistency, substantially improving performance over fixed-chunk baselines.
核心發現： 以動作熵作為不確定性的代理指標，能讓 VLA 模型在反應性與一致性之間自適應取得平衡，顯著優於固定分塊的基準方法。

Learning Dexterous Grasping from Sparse Taxonomy Guidance

Authors: Juhan Park, Taerim Yoon, Seungmin Kim et al. | Submitted: 2026-04-05 | arXiv: 2604.04138 Categories: cs.RO, cs.AI

Research Background: Dexterous robotic grasping requires planning grip configurations that match both object geometry and task intent. Dense pose or contact-point specifications for every object-task pair are impractical, and pure RL lacks user-controllable intervention.
研究背景： 靈巧機器人抓取需要規劃同時匹配物體幾何與任務意圖的抓握構型。為每個物體-任務組合指定密集姿態或接觸點目標不切實際，而純強化學習又缺乏使用者可介入的控制性。

Technical Approach: GRIT is a two-stage framework that first predicts a taxonomy-based grasp specification (sparse high-level category) from scene and task context, then trains a policy conditioned on this specification to generate continuous multi-finger motions. The taxonomy captures relationships between grasp types and object geometries, enabling generalization.
技術方法： GRIT 是兩階段框架，第一階段從場景與任務上下文預測基於分類學的抓握規格（稀疏高層類別），第二階段訓練以此規格為條件的策略來生成連續多指動作。分類學捕捉抓握類型與物體幾何之間的關係，從而實現泛化。

Key Takeaway: Sparse taxonomy guidance achieves 87.9% success rate across novel objects and enables user-controllable grasp strategy selection via high-level category choice.
核心發現： 稀疏分類學引導在新物體上達到 87.9% 的成功率，並允許使用者透過高層類別選擇來控制抓握策略。

VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models

Authors: Ravi Ranjan, Agoritsa Polyzou | Submitted: 2026-04-05 | arXiv: 2604.03956 Categories: cs.CV, cs.AI

Research Background: Deploying VLA models raises safety concerns around removing unsafe, privacy-sensitive, or spurious behaviors. Unlike standalone vision or language models, undesirable knowledge in VLAs can be distributed across perception, alignment, and action-generation layers simultaneously.
研究背景： 部署 VLA 模型面臨安全挑戰，需要移除不安全、隱私敏感或虛假的行為。與獨立的視覺或語言模型不同，VLA 中的不良知識可能同時分佈於感知、對齊和動作生成層中。

Technical Approach: VLA-Forget combines ratio-aware selective editing for perception and cross-modal components with layer-selective unlearning for the reasoning/action transformer blocks. It jointly optimizes three objectives — targeted forgetting, perceptual preservation, and reasoning retention — via staged updates across the visual encoder, projector, and upper transformer layers.
技術方法： VLA-Forget 將針對感知與跨模態組件的比例感知選擇性編輯，與針對推理/動作 Transformer 塊的層選擇性遺忘學習相結合。透過視覺編碼器、投影器和上層 Transformer 的分階段更新，聯合優化三個目標：定向遺忘、感知保留和推理保留。

Key Takeaway: VLA-Forget improves forgetting efficacy by 10%, preserves perceptual specificity by 22%, and reduces post-quantization behavior recovery by 55% relative to strong unlearning baselines.
核心發現： 相較於強基準方法，VLA-Forget 提升遺忘效力 10%、保留感知特異性 22%，並將量化後行為恢復降低 55%。

Build on Priors: Vision–Language–Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

Authors: Pierrick Lorang, Johannes Huemer, Timothy Duggan et al. | Submitted: 2026-04-04 | arXiv: 2604.03759 Categories: cs.RO, cs.AI

Research Background: Long-horizon robot manipulation from few demonstrations remains challenging because existing neuro-symbolic approaches rely on hand-crafted symbolic abstractions or large annotated datasets, limiting scalability.
研究背景： 從少量示範中學習長時域機器人操作仍具挑戰性，因為現有的神經符號方法依賴手工設計的符號抽象或大量標注資料集，限制了可擴展性。

Technical Approach: The framework uses a VLM to automatically segment demonstrations into skills, classify skill transitions, and construct a state-transition graph without manual annotation. An Answer Set Programming solver synthesizes a PDDL planning domain from this graph, and policies are learned at the control-reference level with VLM-driven data augmentation projecting single demos onto new objects.
技術方法： 該框架使用 VLM 自動將示範分割成技能、分類技能轉換並構建狀態轉換圖，無需手工標注。答案集程式設計求解器從此圖合成 PDDL 規劃域，策略在控制參考層面學習，並利用 VLM 驅動的資料增強將單個示範投影到新物體上。

Key Takeaway: The unified pipeline achieves scalable, data-efficient, expert-free neuro-symbolic robot learning validated on a real industrial forklift and a Kinova Gen3 arm from as few as one demonstration.
核心發現： 統一流程實現了可擴展、資料高效、無需專家的神經符號機器人學習，在真實工業叉車和 Kinova Gen3 機械臂上以最少一個示範完成驗證。

A Multi-View 3D Telepresence System for XR Robot Teleoperation

Authors: Enes Ulas Dincer, Manuel Zaremski, Alexandra Nick et al. | Submitted: 2026-04-04 | arXiv: 2604.03730 Categories: cs.RO

Research Background: Effective robot teleoperation requires accurate 3D spatial perception, but conventional 2D video interfaces fail to provide the depth cues needed for contact-rich manipulation tasks.
研究背景： 有效的機器人遠端操作需要準確的三維空間感知，但傳統二維視訊介面無法提供接觸豐富操作任務所需的深度線索。

Technical Approach: The system fuses geometry from three cameras to produce GPU-accelerated point-cloud rendering on standalone Meta Quest 3 VR hardware (~75k points in real time), augmented with a wrist-mounted RGB stream for high-resolution local detail. A 31-participant within-subject study compared this against RGB streams, stereo projection, and point cloud without RGB across three manipulation tasks.
技術方法： 系統融合三個相機的幾何資訊，在獨立式 Meta Quest 3 VR 硬體上產生 GPU 加速的點雲渲染（實時約 75k 個點），並輔以腕部 RGB 串流提供高解析度局部細節。31 名參與者的組內研究跨三個操作任務比較了此系統與 RGB 串流、立體投影及無 RGB 點雲的效果。

Key Takeaway: Combining global 3D point-cloud structure with localized high-resolution RGB detail achieves the best overall teleoperation performance across task success, completion time, workload, and usability.
核心發現： 將全域三維點雲結構與局部高解析度 RGB 細節相結合，在任務成功率、完成時間、工作負荷和可用性方面實現了最佳的整體遠端操作性能。

Human-Robot Copilot for Data-Efficient Imitation Learning

Authors: Rui Yan, Zaitian Gongye, Lars Paulsen et al. | Submitted: 2026-04-04 | arXiv: 2604.03613 Categories: cs.RO

Research Background: Limited demonstrations cause imitation learning policies to enter out-of-distribution states due to compounding errors. HG-DAgger-style methods address this via human intervention, but existing approaches sacrifice either dexterity or generality across robot types.
研究背景： 示範數量不足導致模仿學習策略因複合誤差而進入分佈外狀態。HG-DAgger 式方法透過人工干預解決此問題，但現有方案在靈活性與跨機器人類型泛化性之間存在取捨。

Technical Approach: The Human-Robot Copilot framework introduces a scaling factor for dexterous teleoperation that maintains compatibility with a wide range of industrial and research manipulators. Human corrections are applied intermittently during policy execution, and the framework augments the demonstration dataset with targeted corrective interventions rather than full re-demonstrations.
技術方法： 人機副駕駛框架引入了靈巧遠端操作的縮放因子，同時保持與多種工業和研究型機械臂的相容性。在策略執行過程中間歇性地應用人工修正，框架以有針對性的糾正干預而非完整重新示範來擴充示範資料集。

Key Takeaway: The copilot approach achieves higher manipulation success with the same number of demonstrations while reducing overall data collection time through intermittent rather than continuous human intervention.
核心發現： 副駕駛方式在相同示範數量下實現更高的操作成功率，同時透過間歇性而非持續性的人工干預減少整體資料收集時間。

Belief Dynamics for Detecting Behavioral Shifts in Safe Collaborative Manipulation

Authors: Devashri Naik, Divake Kumar, Nastaran Darabi et al. | Submitted: 2026-04-04 | arXiv: 2604.04967 Categories: cs.RO, cs.LG

Research Background: In shared workspaces, a collaborating agent may switch behavioral strategies mid-task, and detecting this switch is critical to prevent collisions. Reliable regime-change detection under realistic timing tolerances remains an open challenge.
研究背景： 在共享工作空間中，協作代理可能在任務中途切換行為策略，偵測此切換對於防止碰撞至關重要。在現實時間容差下可靠的制度切換偵測仍是未解挑戰。

Technical Approach: UA-TOM is a lightweight belief-tracking module that augments frozen VLA backbones using selective state-space dynamics, causal attention, and prediction-error signals. It tracks the internal hidden-state update magnitude as a proxy for behavioral regime change, adding only 7.4 ms inference overhead. Evaluation spans 10 detection methods, 5 seeds, and 1200 episodes.
技術方法： UA-TOM 是一個輕量級信念追蹤模組，使用選擇性狀態空間動態、因果注意力和預測誤差信號來增強凍結的 VLA 骨幹網路。它追蹤內部隱藏狀態更新幅度作為行為制度切換的代理指標，僅增加 7.4 毫秒的推理開銷。

Key Takeaway: UA-TOM achieves 85.7% detection rate at ±3-step tolerance while reducing close-range collision time to 4.8 steps, outperforming even an oracle baseline without modifying the base VLA policy.
核心發現： UA-TOM 在 ±3 步容差下達到 85.7% 的偵測率，同時將近距離碰撞時間縮短至 4.8 步，甚至優於 Oracle 基準，且無需修改基礎 VLA 策略。

CRAFT: Video Diffusion for Bimanual Robot Data Generation

Authors: Jason Chen, I-Chun Arthur Liu, Gaurav Sukhatme et al. | Submitted: 2026-04-04 | arXiv: 2604.03552 Categories: cs.RO, cs.AI, cs.CV, cs.LG

Research Background: Bimanual manipulation learning is limited by the high cost and narrow visual diversity of real-world demonstrations, constraining policy robustness across viewpoints and object configurations.
研究背景： 雙臂操作學習受限於真實世界示範的高成本和有限視覺多樣性，制約了策略在不同視角和物體配置下的魯棒性。

Technical Approach: CRAFT (Canny-guided Robot Data Generation using Video Diffusion Transformers) conditions video diffusion on edge-based structural cues from simulator trajectories to synthesize physically plausible manipulation videos with action labels. It supports object pose changes, camera viewpoints, lighting variations, cross-embodiment transfer, and multi-view synthesis within a unified augmentation pipeline.
技術方法： CRAFT 以模擬器軌跡中提取的基於邊緣的結構線索為條件，利用視訊擴散模型合成具有動作標籤的物理上合理的操作視訊。它在統一增強管線中支援物體姿態變化、相機視角、光照變化、跨本體遷移和多視角合成。

Key Takeaway: Starting from only a few real demonstrations, CRAFT generates large-scale diverse photorealistic training data that improves bimanual manipulation success rates over existing augmentation strategies without requiring real-robot replay.
核心發現： 從少量真實示範出發，CRAFT 生成大規模多樣化的照片級真實訓練資料，在無需真實機器人重放的情況下提升雙臂操作成功率，優於現有增強策略。

Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control

Authors: Yuxuan Gao, Yedong Shen, Shiqi Zhang et al. | Submitted: 2026-04-04 | arXiv: 2604.03540 Categories: cs.RO

Research Background: Multi-step diffusion policies achieve strong manipulation performance but require tens to hundreds of network function evaluations per action, making them prohibitively slow for high-frequency closed-loop control and online RL.
研究背景： 多步擴散策略在操作任務上表現出色，但每個動作需要數十至數百次網路函數評估，使其對於高頻閉環控制和在線強化學習而言過於緩慢。

Technical Approach: The two-stage framework first trains a Drift-Based Policy (DBP) that internalizes iterative refinement into model parameters via fixed-point drifting objectives, yielding a native one-step generative backbone. Second, Drift-Based Policy Optimization (DBPO) equips the backbone with a stochastic interface for stable online RL fine-tuning without sacrificing the one-step property.
技術方法： 兩階段框架首先訓練漂移策略（DBP），透過固定點漂移目標將迭代細化內化到模型參數中，生成原生單步生成骨幹。其次，漂移策略優化（DBPO）為骨幹提供隨機介面，用於穩定的在線 RL 微調，同時不犧牲單步特性。

Key Takeaway: DBP achieves up to 100x faster inference than multi-step diffusion policies while matching or exceeding their performance, and enables real-world dual-arm control at 105.2 Hz.
核心發現： DBP 比多步擴散策略快達 100 倍的推理速度，同時達到或超越其性能，並在真實世界雙臂控制中實現 105.2 Hz 的頻率。

Optimizing Neurorobot Policy under Limited Demonstration Data through Preference Regret

Authors: Viet Dung Nguyen, Yuhang Song, Anh Nguyen et al. | Submitted: 2026-04-04 | arXiv: 2604.03523 Categories: cs.RO, cs.AI, cs.CV, cs.LG

Research Background: Reinforcement learning from demonstrations typically assumes abundant expert data and i.i.d. distribution — both assumptions fail in real-world robotics where data is scarce and compounding errors accumulate at test time.
研究背景： 從示範中進行強化學習通常假設專家資料充足且符合 i.i.d. 分佈，而這兩個假設在資料稀缺且測試時複合誤差累積的真實機器人環境中均不成立。

Technical Approach: The MYOE (master your own expertise) framework introduces a queryable mixture-of-preferences state space model (QMoP-SSM) that estimates desired goals at each timestep. These estimated goals compute a “preference regret” signal used to optimize the control policy through self-imitation, enabling learning from limited demonstrations without i.i.d. assumptions.
技術方法： MYOE（掌握自身專長）框架引入了可查詢偏好混合狀態空間模型（QMoP-SSM），在每個時間步估計期望目標。這些估計目標計算「偏好遺憾」信號，用於透過自我模仿優化控制策略，從有限示範中學習而無需 i.i.d. 假設。

Key Takeaway: The MYOE framework demonstrates improved robustness, adaptability, and out-of-sample performance compared to state-of-the-art RLfD approaches under limited demonstration conditions.
核心發現： 在有限示範條件下，MYOE 框架在魯棒性、適應性和樣本外性能方面優於最先進的 RLfD 方法。

Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking

Authors: Haotian Xiang, Qin Lu, Yaakov Bar-Shalom | Submitted: 2026-04-03 | arXiv: 2604.03404 Categories: cs.RO, cs.LG

Research Background: Active multi-target tracking requires a robot to balance exploration for undetected targets with exploitation of uncertain tracked ones. Diffusion policies can capture diverse behavioral strategies but lack principled uncertainty-aware expert selection.
研究背景： 主動多目標追蹤要求機器人在探索未偵測目標與利用不確定已追蹤目標之間取得平衡。擴散策略能捕捉多樣化行為策略，但缺乏有原則的不確定性感知專家選擇機制。

Technical Approach: Expert selection is formulated as an offline contextual bandit problem, and a Bayesian multi-head Variational Bayesian Last Layer (VBLL) model predicts each expert strategy’s tracking performance with uncertainty estimates. A Lower Confidence Bound (LCB) criterion selects the expert with the best worst-case predicted performance, conditioning the diffusion policy for action generation.
技術方法： 專家選擇被建模為離線上下文賭博機問題，貝葉斯多頭變分貝葉斯最後一層（VBLL）模型預測每個專家策略的追蹤性能及不確定性估計。下置信界（LCB）標準選擇最佳最壞情況預測性能的專家，為擴散策略的動作生成提供條件。

Key Takeaway: Pessimistic uncertainty-aware expert selection outperforms both the base diffusion policy and standard mixture-of-experts gating in simulated indoor multi-target tracking scenarios.
核心發現： 悲觀的不確定性感知專家選擇在模擬室內多目標追蹤場景中優於基礎擴散策略和標準混合專家門控。

The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

Authors: Takuya Shiba | Submitted: 2026-04-03 | arXiv: 2604.03191 Categories: cs.RO, cs.CV, cs.LG

Research Background: Scaling VLA models by upgrading vision encoders is expected to improve manipulation performance, as it does in vision-language models. However, this intuition breaks down under certain action representations — a gap that has not been systematically explained.
研究背景： 透過升級視覺編碼器來擴展 VLA 模型預計能改善操作性能，如同在視覺語言模型中一樣。然而，這種直覺在某些動作表示下失效，這一差距尚未被系統性地解釋。

Technical Approach: The paper introduces the “Compression Gap” principle: scaling behavior is governed by the tightest information bottleneck in the pipeline. When actions are continuous (Diffusion Policy), the vision encoder is the binding constraint; when actions are discretized via a fixed-capacity codebook (OAT), the codebook becomes the bottleneck. Three lines of evidence on LIBERO — factorial experiments, encoder quality gradients, and codebook size ablations — validate this hypothesis.
技術方法： 論文提出「壓縮差距」原則：擴展行為由流水線中最緊的資訊瓶頸決定。當動作是連續的（擴散策略）時，視覺編碼器是約束因素；當動作透過固定容量碼本（OAT）離散化時，碼本成為瓶頸。在 LIBERO 上的三條證據線——因子實驗、編碼器品質梯度和碼本大小消融——驗證了此假設。

Key Takeaway: Encoder upgrades improve Diffusion Policy by over 21 percentage points but are substantially attenuated for OAT, revealing that identifying information bottlenecks is essential for effective Physical AI scaling.
核心發現： 編碼器升級將擴散策略提升超過 21 個百分點，但對 OAT 的效果大幅衰減，揭示了識別資訊瓶頸對有效擴展物理 AI 至關重要。

Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

Authors: Peiyan Li, Yixiang Chen, Yuan Xu et al. | Submitted: 2026-04-03 | arXiv: 2604.03181 Categories: cs.RO, cs.CV

Research Background: Most robotic manipulation policies rely on 2D observations and backbones pre-trained on static image-text pairs, failing to capture the 3D spatial structure and temporal dynamics needed for data-efficient, generalizable manipulation.
研究背景： 大多數機器人操作策略依賴二維觀測和在靜態圖像-文字對上預訓練的骨幹網路，無法捕捉資料高效、可泛化操作所需的三維空間結構和時間動態。

Technical Approach: MV-VDP (Multi-View Video Diffusion Policy) jointly predicts multi-view heatmap videos and RGB videos, aligning the representation format of video pre-training with action fine-tuning. The model simultaneously specifies what actions to take and how the environment is expected to evolve, using only 10 demonstration trajectories without additional pre-training.
技術方法： MV-VDP（多視角視訊擴散策略）聯合預測多視角熱圖視訊和 RGB 視訊，使視訊預訓練的表示格式與動作微調對齊。模型同時指定要執行的動作和環境預期如何演化，僅使用 10 個示範軌跡，無需額外預訓練。

Key Takeaway: With only 10 demonstrations, MV-VDP outperforms video-prediction, 3D-based, and VLA models on Meta-World and real-world platforms, establishing a new state of the art in data-efficient multi-task manipulation.
核心發現： 僅用 10 個示範，MV-VDP 在 Meta-World 和真實平台上優於基於視訊預測、三維和 VLA 的模型，在資料高效多任務操作上建立了新的最先進水準。

Behavior-Constrained Reinforcement Learning with Receding-Horizon Credit Assignment for High-Performance Control

Authors: Siwei Ju, Jan Tauberschmidt, Oleg Arenz et al. | Submitted: 2026-04-03 | arXiv: 2604.03023 Categories: cs.RO

Research Background: Combining RL’s ability to discover high-performing strategies with imitation learning’s alignment to human behavior is a fundamental robotics challenge — RL diverges from desired behavior while IL is limited by demonstration quality.
研究背景： 將強化學習發現高性能策略的能力與模仿學習對人類行為的對齊相結合是機器人學的基本挑戰——RL 偏離期望行為，而 IL 受示範品質限制。

Technical Approach: The framework introduces a receding-horizon predictive mechanism that models short-term future trajectories and provides look-ahead rewards during training. The policy is conditioned on reference trajectories to represent a distribution of expert-consistent behaviors rather than a single target, evaluated in high-fidelity race car simulation using professional driver data.
技術方法： 框架引入後退視野預測機制，在訓練中對短期未來軌跡建模並提供前瞻獎勵。策略以參考軌跡為條件，表示一個專家一致行為分佈而非單一目標，使用職業賽車手資料在高保真賽車模擬中評估。

Key Takeaway: The approach achieves competitive lap times while maintaining close alignment with expert driving behavior, with human-grounded evaluation confirming that policies reproduce setup-dependent driving characteristics consistent with professional driver feedback.
核心發現： 該方法實現具有競爭力的圈速，同時保持與專家駕駛行為的緊密對齊，人機迴路評估確認策略能夠重現與職業賽車手反饋一致的設置依賴型駕駛特性。

A Flow Matching Framework for Soft-Robot Inverse Dynamics

Authors: Hang Yang, Fangju Yang, Yangming Zhang et al. | Submitted: 2026-04-03 | arXiv: 2604.03006 Categories: cs.RO

Research Background: Soft continuum robots have complex nonlinear actuation dynamics that make inverse dynamics learning difficult. Conventional feedback controllers suffer from chattering, while deterministic regression methods fail to capture the multimodal nonlinear mappings.
研究背景： 軟性連續體機器人具有複雜的非線性驅動動態，使逆動力學學習困難。傳統反饋控制器存在抖動問題，而確定性回歸方法無法捕捉多模態非線性映射。

Technical Approach: Inverse dynamics is reformulated as a conditional flow-matching problem using Rectified Flow (RF) to generate physically consistent control inputs rather than conditional averages. Two variants enhance physical consistency: RF-Physical uses a physics-based prior for residual modeling, and RF-FWD integrates a forward-dynamics consistency loss during flow matching.
技術方法： 逆動力學被重新建模為條件流匹配問題，使用整流流（RF）生成物理上一致的控制輸入而非條件均值。兩種變體增強物理一致性：RF-Physical 使用基於物理的先驗進行殘差建模，RF-FWD 在流匹配期間整合前向動力學一致性損失。

Key Takeaway: The flow-matching framework reduces trajectory tracking RMSE by over 50% compared to MLP/LSTM/Transformer baselines while maintaining sub-millisecond inference at 1.14 m/s peak end-effector velocity.
核心發現： 流匹配框架與 MLP/LSTM/Transformer 基準相比將軌跡追蹤 RMSE 降低超過 50%，同時在 1.14 m/s 峰值末端執行器速度下保持亞毫秒推理延遲。

Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA

Authors: Zihua Wang, Zhitao Lin, Ruibo Li et al. | Submitted: 2026-04-03 | arXiv: 2604.02965 Categories: cs.RO, cs.CL

Research Background: VLA models incur high inference cost, and action chunking reduces computation but creates open-loop execution that is fragile to environmental changes and prone to error accumulation.
研究背景： VLA 模型推理成本高，動作分塊雖然減少了計算量，但產生了對環境變化脆弱且容易誤差累積的開環執行。

Technical Approach: SV-VLA uses a heavy VLA as a low-frequency macro-planner that generates action chunks with planning context, while a lightweight verifier continuously monitors execution against a closed-loop reference action. The verifier triggers replanning only when the planned action diverges from the reference, combining efficiency of chunked prediction with robustness of closed-loop control.
技術方法： SV-VLA 使用重型 VLA 作為低頻宏規劃器，生成帶有規劃上下文的動作塊，同時輕量級驗證器持續監控執行過程並與閉環參考動作對比。驗證器僅在計劃動作偏離參考時觸發重新規劃，結合了分塊預測的效率與閉環控制的魯棒性。

Key Takeaway: SV-VLA achieves efficient and reliable VLA-based control in dynamic environments by decoupling the planning frequency from the verification frequency.
核心發現： SV-VLA 透過將規劃頻率與驗證頻率解耦，在動態環境中實現高效可靠的 VLA 控制。

MFE: A Multimodal Hand Exoskeleton with Interactive Force, Pressure and Thermo-haptic Feedback

Authors: Ziyuan Tang, Yitian Guo, Chenxi Xiao | Submitted: 2026-04-03 | arXiv: 2604.02820 Categories: cs.RO

Research Background: Haptic devices for VR and robot teleoperation typically provide unimodal feedback, leaving a gap in delivering rich multimodal sensory information — force, pressure, and thermal sensations — to users.
研究背景： 用於 VR 和機器人遠端操作的觸覺設備通常提供單模態反饋，在向使用者傳遞豐富的多模態感官資訊（力、壓力和熱感覺）方面存在差距。

Technical Approach: The Multimodal Feedback Exoskeleton (MFE) features 20 DoF for hand pose capture, active force feedback generating 3.5–8.1 N at fingers, electro-osmotic flat actuators providing up to 2.47 kPa pressure and vibration at fingertips, and thermoelectric heat pumps rendering 10–55°C temperatures. Integration with X-Arm 6 and Inspire Hand validates the system.
技術方法： 多模態反饋外骨骼（MFE）具有 20 自由度的手部姿態捕捉、在手指處產生 3.5-8.1 N 的主動力反饋、在指尖提供高達 2.47 kPa 壓力和振動的電滲透平面致動器，以及渲染 10-55°C 溫度的熱電熱泵。與 X-Arm 6 和 Inspire Hand 的整合驗證了系統。

Key Takeaway: The MFE enables users to recognize and manipulate deformable objects and differentiate remote objects by temperature in robotic teleoperation, demonstrating enhanced situational awareness through multimodal feedback.
核心發現： MFE 使使用者能夠在機器人遠端操作中識別並操縱可變形物體，以及透過溫度區分遠端物體，展示了多模態反饋帶來的增強態勢感知能力。

Elastomeric Strain Limitation for Design of Soft Pneumatic Actuators

Authors: Gregory M. Campbell | Submitted: 2026-04-03 | arXiv: 2604.02609 Categories: cs.RO

Research Background: Soft pneumatic actuators (SPAs) can safely exert forces on humans but lack reliable methods for controlling their inflation trajectory and force generation. Electroadhesive strain limiters offer a promising approach for variable shape control.
研究背景： 軟性氣動執行器（SPA）能安全地對人施力，但缺乏可靠的方法來控制其充氣軌跡和力的產生。電附著應變限制器為可變形狀控制提供了有前景的方法。

Technical Approach: The thesis investigates electroadhesive (EA) clutches attached to concentrically strain-limited elastomeric membranes for variable shape generation. A pressure-trajectory model is validated through active learning and automated testing, and an ensemble of neural networks enables inverse membrane design for specifying quasi-static lift trajectories from pressure sweeps.
技術方法： 論文研究附著於同心應變限制彈性膜上的電附著（EA）離合器以實現可變形狀生成。透過主動學習和自動化測試驗證壓力-軌跡模型，神經網路集成實現逆膜設計，用於從壓力掃描中指定準靜態提升軌跡。

Key Takeaway: EA clutch-based strain limiters enable real-time variable trajectory inflation, providing a principled approach to designing human-safe soft actuators with predictable and controllable force-trajectory behavior.
核心發現： 基於 EA 離合器的應變限制器實現了實時可變軌跡充氣，為設計具有可預測和可控力-軌跡行為的人體安全軟性執行器提供了原則性方法。

Tune to Learn: How Controller Gains Shape Robot Policy Learning

Authors: Antonia Bronars, Younghyo Park, Pulkit Agrawal | Submitted: 2026-04-02 | arXiv: 2604.02523 Categories: cs.RO

Research Background: Controller gains are a critical but understudied design decision in robot learning. The conventional wisdom of selecting gains based on task compliance breaks down when paired with state-conditioned policies, since effective stiffness emerges from the policy-control interplay.
研究背景： 控制器增益是機器人學習中一個關鍵但研究不足的設計決策。基於任務順應性選擇增益的傳統思維在與狀態條件策略配合時失效，因為有效剛度從策略-控制的相互作用中湧現。

Technical Approach: The paper systematically investigates how position controller gains affect behavior cloning, RL from scratch, and sim-to-real transfer through extensive experiments across multiple tasks and robot embodiments. Gain regimes are varied along stiffness and damping axes to identify learning-paradigm-specific optimal configurations.
技術方法： 論文透過跨多個任務和機器人本體的大量實驗，系統性地研究位置控制器增益如何影響行為克隆、從頭強化學習和模擬到真實遷移。沿剛度和阻尼軸改變增益制度，以識別特定學習範式的最優配置。

Key Takeaway: Optimal gain selection depends on the learning paradigm: behavior cloning benefits from compliant/overdamped gains, RL can succeed with any gains given compatible hyperparameters, and sim-to-real transfer is harmed by stiff/overdamped gains.
核心發現： 最優增益選擇取決於學習範式：行為克隆受益於順應性/過阻尼增益，RL 在相容超參數下可以在任何增益下成功，而模擬到真實遷移受剛性/過阻尼增益損害。

UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models

Authors: Qiyao Zhang, Shuhua Zheng, Jianli Sun et al. | Submitted: 2026-04-02 | arXiv: 2604.02241 Categories: cs.CV, cs.RO

Research Background: Embodied visual tracking for UAVs in dynamic urban environments with complex semantic requirements demands cross-modal fusion and continuous action generation beyond what traditional tracking methods provide.
研究背景： 無人機在具有複雜語義要求的動態城市環境中的具身視覺追蹤，需要超越傳統追蹤方法的跨模態融合和連續動作生成能力。

Technical Approach: UAV-Track VLA builds on the π0.5 architecture, introducing a temporal compression network to efficiently capture inter-frame dynamics and a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow-matching action expert for decoupled cross-modal feature generation. A dedicated benchmark and dataset with 890K+ frames, 176 tasks, and 85 diverse objects is also released.
技術方法： UAV-Track VLA 基於 π0.5 架構，引入時序壓縮網路以高效捕捉幀間動態，以及包含空間感知輔助定位頭和流匹配動作專家的並行雙分支解碼器，用於解耦跨模態特徵生成。同時發布包含 890K+ 幀、176 個任務和 85 個多樣物體的專用基準和資料集。

Key Takeaway: UAV-Track VLA achieves 61.76% success rate in challenging long-distance pedestrian tracking and reduces inference latency by 33.4% compared to baseline π0.5, enabling real-time UAV control.
核心發現： UAV-Track VLA 在挑戰性長距離行人追蹤中達到 61.76% 成功率，相比基準 π0.5 將推理延遲降低 33.4%，實現即時無人機控制。

Authors: Anirvan Dutta, Simone Tasciotti, Claudia Cusseddu et al. | Submitted: 2026-04-02 | arXiv: 2604.02108 Categories: cs.RO, cs.LG

Research Background: Estimating physical object properties (stiffness, inertia, contact dynamics) is critical for safe robotic manipulation. Vision and touch provide complementary information, but existing frameworks have limited ability to reason about how uncertainty evolves over time during sustained contact.
研究背景： 估計物理物體屬性（剛度、慣性、接觸動態）對於安全的機器人操作至關重要。視覺和觸覺提供互補資訊，但現有框架在持續接觸過程中對不確定性如何隨時間演化的推理能力有限。

Technical Approach: The Cross-Modal Latent Filter (CMLF) learns a structured causal latent state-space of physical object properties. It supports bidirectional cross-modal prior transfer between vision and touch, integrating sensory evidence through a Bayesian inference process that evolves over time, inspired by human multi-sensory active inference.
技術方法： 跨模態潛在濾波器（CMLF）學習物理物體屬性的結構化因果潛在狀態空間。它支援視覺和觸覺之間的雙向跨模態先驗遷移，透過隨時間演化的貝葉斯推理過程整合感官證據，靈感來自人類多感官主動推理。

Key Takeaway: CMLF improves robustness of physical property estimation under uncertainty and exhibits human-like perceptual coupling phenomena including cross-modal illusions, suggesting alignment with human sensory integration principles.
核心發現： CMLF 在不確定性下提高了物理屬性估計的魯棒性，並表現出類人的感知耦合現象（包括跨模態錯覺），表明與人類感官整合原則的一致性。

Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning

Authors: Yuhui Chen, Haoran Li, Zhennan Jiang et al. | Submitted: 2026-04-02 | arXiv: 2604.01860 Categories: cs.RO

Research Background: Fine-tuning expressive generative robot policies via RL remains challenging due to training instability and sample inefficiency, particularly for temporal action chunk policies with multimodal distributions.
研究背景： 透過強化學習微調表達性生成機器人策略仍因訓練不穩定性和樣本低效而具有挑戰性，特別是對於具有多模態分佈的時域動作塊策略。

Technical Approach: POCO (Posterior Optimization with Clipped Objective) formulates policy improvement as a posterior inference problem via Expectation-Maximization, distilling a reward-weighted implicit posterior into the policy without likelihood estimation. It uses an offline-to-online paradigm anchoring online exploration to pre-trained priors and is model-agnostic to scale to large VLA models.
技術方法： POCO（帶裁剪目標的後驗優化）透過期望最大化將策略改進建模為後驗推理問題，將獎勵加權隱式後驗提煉到策略中而無需似然估計。它使用離線到在線範式，將在線探索錨定到預訓練先驗，並且與模型無關，可擴展到大型 VLA 模型。

Key Takeaway: POCO prevents catastrophic policy collapse, outperforms state-of-the-art RL baselines across 7 simulation benchmarks and 4 real-world tasks, achieving 96.7% success rate on real-world contact-rich manipulation.
核心發現： POCO 防止了災難性策略崩潰，在 7 個模擬基準和 4 個真實世界任務中優於最先進的 RL 基準，在真實世界接觸豐富操作上達到 96.7% 成功率。

VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing

Authors: Junyi Zong, Qingxuan Jia, Meixian Shi et al. | Submitted: 2026-04-02 | arXiv: 2604.03322 Categories: cs.CV, cs.AI, cs.RO

Research Background: Quality inspection in smart manufacturing requires identifying intrinsic material and surface properties beyond visible geometry, but vision-only methods fail under occlusion and reflection. Combining vision, tactile sensing, and language for industrial inspection remains underexplored.
研究背景： 智慧製造中的品質檢測需要識別超越可見幾何的內在材料和表面屬性，但純視覺方法在遮擋和反射下失效。結合視覺、觸覺感應和語言進行工業檢測仍未被充分探索。

Technical Approach: VitaTouch uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features compressed into prefix tokens for a large language model. Contrastive learning aligns each modality with text and explicitly couples vision and touch. The VitaSet dataset contains 186 objects, 52k images, and 5.1k verified instruction-answer pairs.
技術方法： VitaTouch 使用模態特定編碼器和雙 Q-Former 提取語言相關的視覺和觸覺特徵，壓縮成前綴 token 供大型語言模型使用。對比學習將每種模態與文本對齊，並明確耦合視覺和觸覺。VitaSet 資料集包含 186 個物體、52k 張圖像和 5.1k 對已驗證的指令-回答對。

Key Takeaway: With LoRA fine-tuning, VitaTouch achieves 100%/96%/92% accuracy for 2/3/5-category defect recognition and 94% end-to-end sorting success in 100 robotic trials.
核心發現： 透過 LoRA 微調，VitaTouch 在 2/3/5 類別缺陷識別中達到 100%/96%/92% 準確率，並在 100 次機器人試驗中實現 94% 的端到端分揀成功率。

DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning

Authors: Yang Zhou, Xiaofeng Wang, Hao Shao et al. | Submitted: 2026-04-02 | arXiv: 2604.01765 Categories: cs.CV, cs.AI, cs.RO

Research Background: World-action models (WAMs) aim to unify VLA reasoning with spatio-temporal world modeling, but existing approaches focus on 2D appearance or latent representations with limited geometric grounding — a critical gap for embodied systems in the physical world.
研究背景： 世界動作模型（WAM）旨在將 VLA 推理與時空世界建模統一，但現有方法側重於幾何基礎有限的二維外觀或潛在表示——對於在物理世界中運行的具身系統而言是一個關鍵差距。

Technical Approach: DriveDreamer-Policy integrates depth generation, future video generation, and motion planning in a single modular architecture using a large language model for multi-modal instruction processing, followed by three lightweight generators. Learning geometry-aware world representations guides both future prediction and planning within a unified framework.
技術方法： DriveDreamer-Policy 在單一模組化架構中整合深度生成、未來視訊生成和運動規劃，使用大型語言模型進行多模態指令處理，隨後是三個輕量級生成器。學習幾何感知世界表示在統一框架內指導未來預測和規劃。

Key Takeaway: DriveDreamer-Policy reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality depth and video predictions.
核心發現： DriveDreamer-Policy 在 Navsim v1 上達到 89.2 PDMS，在 Navsim v2 上達到 88.7 EPDMS，在產生更高品質深度和視訊預測的同時優於現有基於世界模型的方法。

Authors: Yukai Ma, Honglin He, Selina Song et al. | Submitted: 2026-04-02 | arXiv: 2604.01659 Categories: cs.RO

Research Background: Long-horizon urban navigation relies on continuous human operation, causing fatigue and safety concerns. Shared autonomy where human and AI collaborate is promising, but existing methods require both to operate in the same action space, creating high cognitive overhead.
研究背景： 長時域城市導航依賴持續的人工操作，導致疲勞和安全問題。人機協作的共享自主很有前景，但現有方法要求雙方在相同動作空間中操作，產生高認知負荷。

Technical Approach: AURA decomposes urban navigation into high-level human instruction and low-level AI control through a Spatial-Aware Instruction Encoder that aligns various human instructions with visual and spatial context. The MM-CoS dataset comprising teleoperation and vision-language descriptions facilitates training, with online adaptation support.
技術方法： AURA 透過空間感知指令編碼器將城市導航分解為高層人工指令和低層 AI 控制，使各種人工指令與視覺和空間上下文對齊。包含遠端操作和視覺語言描述的 MM-CoS 資料集促進訓練，並支援在線適應。

Key Takeaway: AURA reduces takeover frequency by over 44% compared to similar conditions while effectively following human instructions and improving navigation stability in both simulation and real-world experiments.
核心發現： 與相似條件相比，AURA 將接管頻率降低超過 44%，同時在模擬和真實世界實驗中有效跟隨人工指令並提高導航穩定性。

Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior

Authors: Haochen Niu, Kanyu Zhang, Shuyu Yin et al. | Submitted: 2026-04-02 | arXiv: 2604.01570 Categories: cs.RO

Research Background: In physical manipulation, states admit a neighborhood of near-equivalent valid actions rather than a single correct action. VLA training inherited from linguistic settings ignores this feasible action neighborhood (FAN) property, leading to poor generalization and low sample efficiency.
研究背景： 在物理操作中，狀態允許一個近似等效的有效動作鄰域，而非單一正確動作。從語言設置繼承的 VLA 訓練忽略了這種可行動作鄰域（FAN）屬性，導致泛化能力差和樣本效率低。

Technical Approach: A FAN-guided regularizer shapes the model’s output distribution to align with the geometry of the feasible action neighborhood by introducing a Gaussian prior that promotes locally smooth and unimodal predictions around the preferred direction and magnitude. The method integrates into both reinforced fine-tuning (RFT) and supervised fine-tuning (SFT) pipelines.
技術方法： FAN 引導正則化器透過引入高斯先驗，使模型輸出分佈與可行動作鄰域的幾何形狀對齊，促進圍繞首選方向和幅度的局部平滑和單模態預測。該方法整合到強化微調（RFT）和監督微調（SFT）流水線中。

Key Takeaway: FAN-guided regularization achieves significant improvements in sample efficiency and success rates in both in-distribution and out-of-distribution manipulation scenarios by aligning with the intrinsic action tolerance of physical manipulation.
核心發現： FAN 引導正則化透過與物理操作的內在動作容差對齊，在分佈內和分佈外操作場景中顯著提升樣本效率和成功率。

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Authors: Jiawei Chen, Simin Huang, Jiawei Du et al. | Submitted: 2026-04-02 | arXiv: 2604.01618 Categories: cs.CV, cs.AI

Research Background: VLA models face adversarial robustness challenges from physically realizable attacks. Adversarial 3D textures on manipulated objects represent a more dangerous threat than language or 2D visual attacks since they are naturally present in physical deployments.
研究背景： VLA 模型面臨來自物理上可實現攻擊的對抗魯棒性挑戰。被操作物體上的對抗三維紋理比語言或二維視覺攻擊代表更危險的威脅，因為它們在物理部署中自然存在。

Technical Approach: Tex3D introduces Foreground-Background Decoupling (FBD) to enable differentiable texture optimization through dual-renderer alignment, plus Trajectory-Aware Adversarial Optimization (TAAO) that prioritizes behaviorally critical frames and uses vertex-based parameterization for stable optimization across long-horizon tasks and diverse viewpoints.
技術方法： Tex3D 引入前景-背景解耦（FBD）透過雙渲染器對齊實現可微紋理優化，以及軌跡感知對抗優化（TAAO），優先考慮行為關鍵幀，並使用基於頂點的參數化在長時域任務和多樣視角中穩定優化。

Key Takeaway: Tex3D achieves task failure rates up to 96.7% in simulation and real-robot settings, exposing critical VLA vulnerabilities to physically grounded 3D adversarial attacks and motivating robustness-aware training.
核心發現： Tex3D 在模擬和真實機器人設置中達到高達 96.7% 的任務失敗率，揭示了 VLA 對物理基礎三維對抗攻擊的關鍵漏洞，激勵了魯棒性感知訓練。

A soft and lightweight fabric-based pneumatic interface for multimodal fingertip tactile feedback

Authors: Rui Chen, Daniele Leonardis, Antonio Frisoli | Submitted: 2026-04-01 | arXiv: 2604.01390 Categories: cs.RO

Research Background: Wearable fingertip haptic devices for VR and teleoperation struggle to simultaneously achieve adequate tactile output, low mass, simple fabrication, and untethered portability — limiting their practical deployment.
研究背景： 用於 VR 和遠端操作的可穿戴指尖觸覺設備難以同時實現足夠的觸覺輸出、低重量、簡單製造和無線可攜帶性，限制了其實際部署。

Technical Approach: The device uses four pneumatic chambers fabricated from thermoplastic polyurethane-coated fabric via CNC heat-sealing, weighing 2.1 g and operating untethered with a wrist-mounted control unit. A 15-participant psychophysical study evaluates classification of three tactile modes: contact configuration, directional sliding, and vibrotactile frequency.
技術方法： 該設備使用透過 CNC 熱封從熱塑性聚氨酯塗層織物製造的四個氣動腔體，重量 2.1 g，配備腕部安裝控制單元無線運行。一項 15 名參與者的心理物理研究評估了三種觸覺模式的分類：接觸配置、方向滑動和振動觸覺頻率。

Key Takeaway: Fabric-based pneumatic actuation achieves over 90% classification accuracy across three distinct tactile modes, establishing it as a viable technology for lightweight, low-cost multimodal fingertip haptic interfaces.
核心發現： 基於織物的氣動驅動在三種不同觸覺模式下達到超過 90% 的分類準確率，確立其作為輕量、低成本多模態指尖觸覺介面的可行技術路線。

AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction

Authors: Aiza Maksutova, Lalithkumar Seenivasan, Hao Ding et al. | Submitted: 2026-04-01 | arXiv: 2604.01371 Categories: cs.CV, cs.AI, cs.RO

Research Background: Surgical automation requires predicting not just what actions to take but where instruments safely interact on tissue surfaces. Current methods lack explicit tool-action-specific spatial conditioning for safe tissue interaction regions during cholecystectomy.
研究背景： 手術自動化不僅需要預測執行什麼動作，還需要預測器械在組織表面上安全互動的位置。當前方法缺乏在膽囊切除術中對安全組織互動區域的明確工具-動作特定空間條件。

Technical Approach: AffordTissue combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning for generalization across instrument-action pairs, and a DiT-style decoder for dense affordance heatmap prediction. The benchmark covers 15,638 video clips across 103 procedures with six unique tool-action pairs.
技術方法： AffordTissue 結合了跨多個視角捕捉工具運動和組織動態的時序視覺編碼器、用於跨器械-動作對泛化的語言條件，以及用於密集可供性熱圖預測的 DiT 風格解碼器。基準涵蓋 103 個手術程序中 15,638 個視訊片段，包含六種唯一工具-動作對。

Key Takeaway: AffordTissue achieves 20.6 px ASSD vs. 60.2 px for VLM baselines, demonstrating that task-specific architecture significantly outperforms large foundation models for dense surgical affordance prediction.
核心發現： AffordTissue 達到 20.6 px ASSD，而 VLM 基準為 60.2 px，表明任務特定架構在密集手術可供性預測方面顯著優於大型基礎模型。

Functional Force-Aware Retargeting from Virtual Human Demos to Soft Robot Policies

Authors: Uksang Yoo, Mengjia Zhu, Evan Pezent et al. | Submitted: 2026-04-01 | arXiv: 2604.01224 Categories: cs.RO

Research Background: Transferring human dexterous manipulation skills to soft robotic hands is hindered by extreme morphological differences and nonlinear compliance. Kinematic retargeting alone fails to capture the functional intent encoded in contact forces.
研究背景： 將人類靈巧操作技能遷移到軟性機器人手部受到極端形態差異和非線性順應性的阻礙。單純的運動學重定向無法捕捉接觸力中編碼的功能意圖。

Technical Approach: SoftAct uses VR to capture rich human demonstrations including hand kinematics, contact patches, and force information. A two-stage force-aware retargeting algorithm attributes demonstrated forces to fingers proportionally, then performs online retargeting combining end-effector pose tracking with geodesic-weighted contact refinements using force magnitude.
技術方法： SoftAct 使用 VR 捕捉豐富的人類示範，包括手部運動學、接觸補丁和力資訊。兩階段力感知重定向算法按比例將示範力歸因於手指，然後結合末端執行器姿態追蹤與使用力幅度的測地加權接觸細化進行在線重定向。

Key Takeaway: SoftAct reduces fingertip trajectory tracking RMSE by up to 55% and variance by up to 69% over kinematic baselines, with consistently higher real-world zero-shot deployment success for contact-rich tasks.
核心發現： SoftAct 相比運動學基準將指尖軌跡追蹤 RMSE 降低最高 55%、方差降低最高 69%，在接觸豐富任務的真實世界零樣本部署中持續取得更高成功率。

BAT: Balancing Agility and Stability via Online Policy Switching for Long-Horizon Whole-Body Humanoid Control

Authors: Donghoon Baek, Sang-Hun Kim, Sehoon Ha | Submitted: 2026-04-01 | arXiv: 2604.01064 Categories: cs.RO

Research Background: Developing unified frameworks for agile, precise, and robust long-horizon whole-body humanoid behavior remains challenging. Coupled whole-body policies and decoupled modular policies each offer complementary strengths without a principled integration.
研究背景： 開發用於靈活、精確和魯棒的長時域全身人形機器人行為的統一框架仍具挑戰性。耦合全身策略和解耦模組化策略各自提供互補優勢，但缺乏有原則的整合。

Technical Approach: BAT dynamically selects between two complementary whole-body RL controllers via a switching policy learned through hierarchical RL with expert guidance from sliding-horizon policy pre-evaluation. An option-aware VQ-VAE predicts option preference from discrete motion token sequences, with confidence-weighted fusion for the final decision, evaluated on Unitree G1.
技術方法： BAT 透過從滑動視野策略預評估獲得專家引導的分層 RL 學習的切換策略，動態選擇兩個互補的全身 RL 控制器。選項感知 VQ-VAE 從離散運動 token 序列預測選項偏好，最終決策採用置信度加權融合，在 Unitree G1 上評估。

Key Takeaway: BAT enables versatile long-horizon loco-manipulation on the Unitree G1 humanoid robot and outperforms prior methods across diverse tasks by dynamically balancing agility and stability through online policy switching.
核心發現： BAT 透過在線策略切換動態平衡敏捷性和穩定性，在 Unitree G1 人形機器人上實現多功能的長時域運動-操作，並在多樣任務中優於先前方法。

An Integrated Soft Robotic System for Measuring Vital Signs in Search and Rescue Environments

Authors: Jorge Francisco García-Samartín, Christyan Cruz Ulloa, Andrés Sánchez-Silva et al. | Submitted: 2026-04-01 | arXiv: 2604.00971 Categories: cs.RO

Research Background: Search-and-rescue robots face challenges in victim assessment, particularly for heart rate and blood pressure measurement in post-disaster scenarios. Existing solutions lack the capability to measure pressure in unstructured environments.
研究背景： 搜救機器人在受害者評估方面面臨挑戰，特別是在災後場景中的心率和血壓測量。現有解決方案缺乏在非結構化環境中測量壓力的能力。

Technical Approach: A soft gripper designed to envelop a victim’s arm and inflate like a sphygmomanometer is integrated into a mobile robotic system with a specialized portability system. Various signal processing algorithms extract pulse and blood pressure readings, validated through statistical analysis including homoscedasticity testing.
技術方法： 設計為包裹受害者手臂並像血壓計一樣充氣的軟性抓手，與帶有專門可攜性系統的移動機器人系統整合。各種信號處理算法提取脈搏和血壓讀數，透過包括同方差性測試的統計分析進行驗證。

Key Takeaway: The system achieves a pulse bias of 4 BPM and approximately 5 mmHg bias for blood pressure across diverse victim positions, demonstrating suitability for real post-disaster search-and-rescue operations.
核心發現： 該系統在不同受害者姿勢下實現 4 BPM 的脈搏偏差和約 5 mmHg 的血壓偏差，展示了在真實災後搜救行動中的適用性。

A Dual-Action Fabric-Based Soft Robotic Glove for Ergonomic Hand Rehabilitation

Authors: Rui Chen, Firman Isma Serdana, Domenico Chiaradia et al. | Submitted: 2026-04-01 | arXiv: 2604.00768 Categories: cs.RO, cs.HC

Research Background: Hand impairment following neurological disorders substantially limits independence in daily activities. Soft robotic gloves face persistent challenges in ergonomic fit and independent flexion-extension actuation that constrain clinical utility.
研究背景： 神經系統疾病後的手部損傷大幅限制日常活動的獨立性。軟性機器人手套在人體工學貼合和獨立屈伸驅動方面面臨持續挑戰，制約了臨床實用性。

Technical Approach: The glove incorporates five independently controlled dual-action actuators for finger flexion/extension plus a thumb abduction actuator, fabricated using CNC heat-sealing to create symmetrical-chamber actuators that adopt a concave surface upon inflation for maximum finger contact. Tested with 10 healthy subjects and a pilot with 3 spinal cord injury patients across 7 functional tasks.
技術方法： 手套採用五個獨立控制的雙動作執行器用於手指屈伸，加上拇指外展執行器，使用 CNC 熱封製造對稱腔體執行器，充氣時呈凹面以最大化手指接觸。與 10 名健康受試者測試，並對 3 名脊髓損傷患者進行 7 項功能任務的初步試驗。

Key Takeaway: Active glove assistance significantly reduces forearm muscle activity and promotes more natural grasp patterns in SCI patients, though current actuation interface increases task completion time.
核心發現： 主動手套輔助顯著減少前臂肌肉活動，並促進脊髓損傷患者更自然的抓握模式，儘管當前驅動介面增加了任務完成時間。

How to Train your Tactile Model: Tactile Perception with Multi-fingered Robot Hands

Authors: Christopher J. Ford, Kaichen Shi, Laura Butcher et al. | Submitted: 2026-04-01 | arXiv: 2604.00744 Categories: cs.RO

Research Background: Rapid deployment of new vision-based tactile sensors is limited by CNN-based perception models that require large sensor-specific datasets and retraining for each new sensor due to differences in lens, illumination, and wear.
研究背景： 基於視覺的新型觸覺感測器的快速部署受到基於 CNN 的感知模型的限制，這些模型因鏡頭、照明和磨損的差異而需要針對每個新感測器收集大量特定資料集並重新訓練。

Technical Approach: TacViT is a tactile perception model based on Vision Transformers that leverages global self-attention to extract robust features from tactile images. By learning generalizable representations rather than sensor-specific features, TacViT enables accurate contact property inference on previously unseen sensors from a five-fingered robot hand without retraining.
技術方法： TacViT 是基於 Vision Transformer 的觸覺感知模型，利用全局自注意力從觸覺圖像中提取魯棒特徵。透過學習可泛化表示而非感測器特定特徵，TacViT 能夠在無需重新訓練的情況下對五指機器人手上以前未見過的感測器進行準確的接觸屬性推理。

Key Takeaway: TacViT demonstrates superior generalization performance compared to CNNs on new sensors, significantly reducing data collection and retraining requirements to accelerate deployment of new tactile sensors.
核心發現： TacViT 在新感測器上展現出優於 CNN 的泛化性能，顯著減少資料收集和重新訓練需求，加速新觸覺感測器的部署。

A Physical Imitation Learning Pipeline for Energy-Efficient Quadruped Locomotion Assisted by Parallel Elastic Joint

Authors: Huyue Ma, Yurui Jin, Helmut Hauser et al. | Submitted: 2026-04-01 | arXiv: 2604.00611 Categories: cs.RO

Research Background: Animal locomotion exploits passive body dynamics for energy efficiency — a principle known as Embodied Physical Intelligence. Robot designs typically suppress rather than exploit intrinsic body dynamics, missing opportunities for energy savings.
研究背景： 動物運動利用被動身體動態提高能效，這一原則稱為具身物理智能。機器人設計通常壓制而非利用內在身體動態，錯失了節能機會。

Technical Approach: Physical Imitation Learning (PIL) distills an RL control policy into physically implementable body responses that are offloaded to passive Parallel Elastic Joints (PEJs). A residual motor policy recovers the RL performance gap while the body physically implements the extracted portion, achieving brain-body co-design without expanding the design search space.
技術方法： 物理模仿學習（PIL）將強化學習控制策略提煉成可物理實現的身體響應，卸載到被動並聯彈性關節（PEJ）中。殘差電機策略恢復 RL 性能差距，同時身體物理實現提取的部分，在不擴大設計搜索空間的情況下實現腦體協同設計。

Key Takeaway: PIL offloads up to 87% of mechanical power to PEJs on flat terrain, demonstrating that distilling control policies into passive body mechanics is a computationally efficient path toward embodied physical intelligence for quadruped locomotion.
核心發現： PIL 在平坦地形上將高達 87% 的機械功率卸載到 PEJ，證明將控制策略提煉到被動身體力學中是實現四足機器人具身物理智能的計算高效路徑。

Multi-Camera View Scaling for Data-Efficient Robot Imitation Learning

Authors: Yichen Xie, Yixiao Wang, Shuqi Zhao et al. | Submitted: 2026-04-01 | arXiv: 2604.00557 Categories: cs.RO, cs.CV, cs.LG

Research Background: Imitation learning generalization is constrained by demonstration diversity, but collecting demonstrations across varied environments is costly. Exploiting inherent scene diversity without additional human effort remains underexplored.
研究背景： 模仿學習的泛化受示範多樣性約束，但在不同環境中收集示範成本高昂。在不需要額外人工努力的情況下利用內在場景多樣性仍未被充分探索。

Technical Approach: The framework uses multiple synchronized camera perspectives to generate pseudo-demonstrations from each expert trajectory, enriching training distribution and improving viewpoint invariance. Different action spaces (world-frame vs. camera-space) interact differently with view scaling, and a multiview action aggregation method allows single-view policies to benefit from multiple cameras during deployment.
技術方法： 框架使用多個同步相機視角從每個專家軌跡生成偽示範，豐富訓練分佈並提高視角不變性。不同動作空間（世界坐標系 vs. 相機坐標系）與視角縮放的互動不同，多視角動作聚合方法允許單視角策略在部署時受益於多個相機。

Key Takeaway: Scaling camera views provides a practical and scalable solution for imitation learning data efficiency, requiring minimal additional hardware and integrating seamlessly with existing algorithms.
核心發現： 擴展相機視角為模仿學習資料效率提供了實際且可擴展的解決方案，只需最少的額外硬體並與現有算法無縫整合。

HapCompass: A Rotational Haptic Device for Contact-Rich Robotic Teleoperation

Authors: Xiangshan Tan, Jingtian Ji, Tianchong Jiang et al. | Submitted: 2026-03-31 | arXiv: 2603.30042 Categories: cs.RO, cs.HC

Research Background: Contact-rich teleoperation requires intuitive directional haptic cues, but existing solutions like non-directional vibration or vibrotactile arrays provide limited information or suffer from perceptual interference.
研究背景： 接觸豐富的遠端操作需要直覺性的方向觸覺線索，但現有解決方案（如非方向性振動或振動觸覺陣列）提供的資訊有限或存在感知干擾問題。

Technical Approach: HapCompass is a low-cost wearable device that renders 2D directional cues by mechanically rotating a single linear resonant actuator (LRA), providing compass-like directional feedback without requiring complex multi-actuator arrays. User studies evaluate success rate, completion time, and maximum contact force for teleoperated manipulation compared to vision-only and non-directional baselines.
技術方法： HapCompass 是一種低成本可穿戴設備，透過機械旋轉單個線性共振執行器（LRA）渲染二維方向線索，提供類似指南針的方向反饋，無需複雜的多執行器陣列。用戶研究評估遠端操作相比純視覺和非方向性基準的成功率、完成時間和最大接觸力。

Key Takeaway: HapCompass increases teleoperation success rate, decreases completion time and contact force, and improves imitation learning demonstration quality, while being low-cost and simple to fabricate.
核心發現： HapCompass 提高了遠端操作成功率、縮短了完成時間和接觸力，並提升了模仿學習示範品質，同時成本低廉且製造簡單。

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Authors: Yi Chen, Yuying Ge, Hui Zhou et al. | Submitted: 2026-03-31 | arXiv: 2603.29844 Categories: cs.RO, cs.AI, cs.CV, cs.LG

Research Background: Most end-to-end VLAs treat the VLM primarily as a multimodal encoder that directly maps features to low-level actions, underutilizing its high-level decision-making potential and causing training instability that degrades rich semantic representations.
研究背景： 大多數端到端 VLA 將 VLM 主要視為直接將特徵映射到低層動作的多模態編碼器，未充分利用其高層決策潛力，並導致降低豐富語義表示的訓練不穩定性。

Technical Approach: DIAL introduces a differentiable latent intent bottleneck: a VLM-based System-2 performs latent world modeling by synthesizing visual foresight within the VLM’s native feature space, while a lightweight System-1 decodes the predicted intent with current observations into robot actions via latent inverse dynamics. Two-stage training (decoupled warmup then end-to-end joint optimization) ensures stability.
技術方法： DIAL 引入可微潛在意圖瓶頸：基於 VLM 的 System-2 透過在 VLM 原生特徵空間內合成視覺預見來執行潛在世界建模，輕量級 System-1 透過潛在逆動力學將預測意圖與當前觀測解碼為機器人動作。兩階段訓練（解耦預熱再端到端聯合優化）確保穩定性。

Key Takeaway: DIAL establishes a new state-of-the-art on RoboCasa GR1 Tabletop with 10x fewer demonstrations than prior methods, and demonstrates zero-shot generalization to unseen objects on a humanoid robot.
核心發現： DIAL 在 RoboCasa GR1 桌面基準上以比先前方法少 10 倍的示範建立了新的最先進水準，並在人形機器人上展示了對未見物體的零樣本泛化。

Authors: Andrew Jeong, Jaemin Kim, Sebin Lee et al. | Submitted: 2026-03-31 | arXiv: 2603.29409 Categories: cs.RO

Research Background: Robotic manipulation involves coupled kinematic and semantic state transitions, but existing approaches plan within either semantic or latent space without explicitly aligning cross-modal transitions, limiting their effectiveness.
研究背景： 機器人操作涉及耦合的運動學和語義狀態轉換，但現有方法在語義或潛在空間內規劃，而不明確對齊跨模態轉換，限制了其效果。

Technical Approach: CLaD models how proprioceptive and semantic states jointly evolve under actions through asymmetric cross-attention that allows kinematic transitions to query semantic ones. Self-supervised objectives with EMA target encoders and auxiliary reconstruction losses prevent representation collapse while anchoring predictions to observable states, with predicted foresights conditioning a diffusion policy.
技術方法： CLaD 透過讓運動學轉換查詢語義轉換的非對稱交叉注意力，對本體感覺和語義狀態如何在動作下聯合演化建模。帶 EMA 目標編碼器的自監督目標和輔助重建損失防止表示崩潰，同時將預測錨定到可觀測狀態，預測預見條件化擴散策略。

Key Takeaway: CLaD achieves 94.7% success rate on LIBERO-LONG, competitive with large VLAs but with significantly fewer parameters, demonstrating the value of grounded cross-modal latent foresight for long-horizon manipulation.
核心發現： CLaD 在 LIBERO-LONG 上達到 94.7% 成功率，與大型 VLA 競爭但參數顯著更少，證明了基礎跨模態潛在預見對長時域操作的價值。

Kilohertz-Safe: A Scalable Framework for Constrained Dexterous Retargeting

Authors: Yinxiao Tian, Ziyi Yang, Zinan Zhao et al. | Submitted: 2026-03-31 | arXiv: 2603.29213 Categories: cs.RO

Research Background: Dexterous hand teleoperation requires motion retargeting at kilohertz-level frequencies with hard safety constraints, but nonlinear optimization-based approaches are too slow and learning-based methods lack formal safety guarantees.
研究背景： 靈巧手遠端操作需要千赫茲頻率的運動重定向並伴有硬性安全約束，但基於非線性優化的方法過慢，而基於學習的方法缺乏正式安全保證。

Technical Approach: The framework reformulates the nonlinear retargeting problem as a convex quadratic program in joint differential space, incorporating kinematic limits and collision avoidance through systematic linearization. Control barrier functions provide formal safety guarantees, validated on the Wuji Hand platform with comparisons against Dex-Retargeting and GeoRT.
技術方法： 框架將非線性重定向問題重新建模為關節微分空間中的凸二次規劃，透過系統線性化結合運動學限制和碰撞避免。控制屏障函數提供正式安全保證，在五指手平台上驗證，並與 Dex-Retargeting 和 GeoRT 比較。

Key Takeaway: The framework achieves 9.05 ms average latency with over 95% of retargeted frames satisfying safety criteria, enabling kilohertz-level dexterous retargeting with formal safety guarantees.
核心發現： 框架實現 9.05 毫秒平均延遲，超過 95% 的重定向幀滿足安全標準，實現了具有正式安全保證的千赫茲級靈巧重定向。

Generalizable Dense Reward for Long-Horizon Robotic Tasks

Authors: Silong Yong, Stephen Sheng, Carl Qi et al. | Submitted: 2026-03-31 | arXiv: 2604.00055 Categories: cs.RO, cs.CV, cs.LG

Research Background: Foundation policies trained via large-scale imitation learning struggle with long-horizon tasks due to distribution shift. RL fine-tuning is limited by manual reward engineering requirements that prevent generalization across diverse tasks.
研究背景： 透過大規模模仿學習訓練的基礎策略因分佈偏移而在長時域任務上表現欠佳。RL 微調受到手動獎勵工程需求的限制，阻礙了跨多樣任務的泛化。

Technical Approach: VLLR combines an extrinsic reward from LLMs/VLMs for task progress recognition and an intrinsic reward based on policy self-certainty. LLMs decompose tasks into verifiable subtasks, VLMs estimate progress to initialize the value function for a warm-up phase, and self-certainty provides per-step intrinsic guidance throughout PPO fine-tuning.
技術方法： VLLR 結合來自 LLM/VLM 用於任務進度識別的外在獎勵和基於策略自我確定性的內在獎勵。LLM 將任務分解為可驗證的子任務，VLM 估計進度以初始化預熱階段的值函數，自我確定性在整個 PPO 微調過程中提供每步內在引導。

Key Takeaway: VLLR achieves up to 56% absolute success rate gains over pretrained policies and up to 10% gains on out-of-distribution tasks on the CHORES benchmark, all without manual reward engineering.
核心發現： VLLR 在 CHORES 基準上相比預訓練策略實現高達 56% 的絕對成功率提升，在分佈外任務上提升高達 10%，全程無需手動獎勵工程。

FocusVLA: Focused Visual Utilization for Vision-Language-Action Models

Authors: Yichi Zhang, Weihao Yuan, Yizhuo Zhang et al. | Submitted: 2026-03-30 | arXiv: 2603.28740 Categories: cs.RO

Research Background: VLA models suffer from three bottlenecks in visual processing: architectural bias causing them to overlook visual details, excessive visual tokens making attention hard to focus, and task-irrelevant visual noise — together impairing action quality.
研究背景： VLA 模型在視覺處理上面臨三個瓶頸：導致忽略視覺細節的架構偏差、使注意力難以聚焦的過多視覺 token，以及任務無關的視覺雜訊——共同損害動作品質。

Technical Approach: FocusVLA proposes Modality Cascaded Attention to eliminate shortcut pathways that bypass visual details, and Focus Attention that dynamically selects task-relevant visual patches while explicitly suppressing irrelevant noise. Empirical validation confirms that VLA performance is primarily limited by visual utilization rather than representation quality.
技術方法： FocusVLA 提出模態級聯注意力來消除繞過視覺細節的捷徑路徑，以及動態選擇任務相關視覺補丁同時明確抑制不相關雜訊的焦點注意力。實證驗證確認 VLA 性能主要受視覺利用而非表示品質限制。

Key Takeaway: FocusVLA effectively leverages visual details for dexterous manipulation and substantially improves performance and convergence speed across diverse simulated and real-world robotic benchmarks.
核心發現： FocusVLA 有效利用視覺細節進行靈巧操作，在多樣化的模擬和真實世界機器人基準上大幅提升性能和收斂速度。

StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

Authors: Yiran Shi, Dongqi Guo, Tianchen Zhao et al. | Submitted: 2026-03-30 | arXiv: 2603.28565 Categories: cs.RO, cs.CV

Research Background: VLA models suffer from frequent execution halting because observation, action generation, and execution must proceed sequentially, creating high latency that prevents fluid real-world deployment on edge platforms.
研究背景： VLA 模型因觀測、動作生成和執行必須順序進行而頻繁暫停執行，產生高延遲，阻礙了在邊緣平台上的流暢真實世界部署。

Technical Approach: StreamingVLA enables asynchronous parallelization across VLA stages by: replacing action chunking with action flow matching that learns trajectory flows (overlapping generation and execution latency), and designing an action saliency-aware adaptive observation mechanism that overlaps execution and observation latency.
技術方法： StreamingVLA 透過以下方式跨 VLA 階段實現非同步並行化：用學習軌跡流的動作流匹配取代動作分塊（重疊生成和執行延遲），以及設計動作顯著性感知的自適應觀測機制（重疊執行和觀測延遲）。

Key Takeaway: StreamingVLA achieves 2.4x latency speedup and reduces execution halting by 6.5x without sacrificing performance, enabling substantially more fluid VLA-based robot control.
核心發現： StreamingVLA 在不犧牲性能的情況下實現 2.4 倍延遲加速並將執行暫停減少 6.5 倍，實現更流暢的 VLA 機器人控制。

ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

Authors: Yu Sun, Meng Cao, Ping Yang et al. | Submitted: 2026-03-30 | arXiv: 2603.28545 Categories: cs.RO, cs.CV

Research Background: Existing VLA and world model benchmarks are largely simulator-centric and fail to capture the reality gap caused by perception noise, contact dynamics, and hardware constraints. Fragmented real-world evaluations prevent fair comparison across robot platforms.
研究背景： 現有的 VLA 和世界模型基準主要以模擬器為中心，無法捕捉感知雜訊、接觸動態和硬體限制造成的現實差距。碎片化的真實世界評估阻礙了跨機器人平台的公平比較。

Technical Approach: ManipArena provides a standardized evaluation framework with 20 diverse tasks across 10,812 expert trajectories emphasizing semantic and spatial reasoning, multi-level generalization through OOD settings, long-horizon mobile manipulation, rich sensory diagnostics, and synchronized real-to-sim environments via high-quality 3D scanning.
技術方法： ManipArena 提供一個標準化評估框架，包含 20 個多樣任務、10,812 條強調語義和空間推理的專家軌跡、透過 OOD 設置的多級泛化、長時域移動操作、豐富的感測器診斷，以及透過高品質三維掃描的同步真實到模擬環境。

Key Takeaway: ManipArena enables fair, realistic, and reproducible evaluation bridging simulation and real-world execution, providing a scalable foundation for diagnosing and advancing embodied intelligence systems.
核心發現： ManipArena 實現公平、真實和可重現的評估，橋接模擬和真實世界執行，為診斷和推進具身智能系統提供可擴展的基礎。

Feel Robot Feels: Tactile Feedback Array Glove for Dexterous Manipulation

Authors: Feiyu Jia, Xiaojie Niu, Sizhe Yang et al. | Submitted: 2026-03-30 | arXiv: 2603.28542 Categories: cs.RO

Research Background: Dexterous teleoperation is constrained by inaccurate hand-robot motion mapping and limited tactile feedback that forces operators to rely on vision alone, hindering perception of contact geometry and force.
研究背景： 靈巧遠端操作受到不準確的手-機器人運動映射和有限觸覺反饋的制約，迫使操作員僅依賴視覺，阻礙了對接觸幾何和力的感知。

Technical Approach: TAG (Tactile Array Glove) integrates non-contact magnetic sensing for drift-free 21-DoF joint tracking with errors below 1 degree, and equips each finger with a 32-actuator tactile array in a 2 cm² module providing spatial activation patterns that represent physical interactions at the robot end-effector.
技術方法： TAG（觸覺陣列手套）整合了非接觸式磁感應，用於誤差低於 1 度的無漂移 21 自由度關節追蹤，並為每根手指配備 2 cm² 模組中的 32 執行器觸覺陣列，提供代表機器人末端執行器物理互動的空間激活模式。

Key Takeaway: TAG enables reliable real-time perception of contact geometry and dynamic force in dexterous teleoperation, improving success rates in contact-rich tasks and the reliability of demonstration data for learning-based manipulation.
核心發現： TAG 在靈巧遠端操作中實現可靠的實時接觸幾何和動態力感知，提升接觸豐富任務的成功率和基於學習的操作示範資料的可靠性。

Tac2Real: Reliable and GPU Visuotactile Simulation for Online Reinforcement Learning and Zero-Shot Real-World Deployment

Authors: Ningyu Yan, Shuai Wang, Xing Shen et al. | Submitted: 2026-03-30 | arXiv: 2603.28475 Categories: cs.RO

Research Background: Policy learning with tactile feedback in simulation requires balancing physics fidelity and computational efficiency — a difficult combination that limits online RL training with visuotactile sensors.
研究背景： 在模擬中使用觸覺反饋進行策略學習需要在物理保真度和計算效率之間取得平衡，這一困難的組合限制了使用視觸覺感測器的在線 RL 訓練。

Technical Approach: Tac2Real integrates PNCG-IPC (Preconditioned Nonlinear Conjugate Gradient Incremental Potential Contact) with a multi-node, multi-GPU parallel simulation architecture to generate marker displacement fields at interactive rates. TacAlign systematically narrows structured and stochastic sim-to-real domain gaps for reliable zero-shot transfer.
技術方法： Tac2Real 將 PNCG-IPC（預條件非線性共軛梯度增量勢能接觸）與多節點、多 GPU 並行模擬架構整合，以互動速率生成標記位移場。TacAlign 系統性地縮小結構性和隨機性的模擬到真實領域差距以實現可靠的零樣本遷移。

Key Takeaway: Tac2Real enables efficient online RL training with visuotactile feedback and achieves high success rate in zero-shot real-world peg insertion, validating its effectiveness as a tactile simulation framework.
核心發現： Tac2Real 實現了具有視觸覺反饋的高效在線 RL 訓練，並在零樣本真實世界插銷任務中達到高成功率，驗證了其作為觸覺模擬框架的有效性。

Tele-Catch: Adaptive Teleoperation for Dexterous Dynamic 3D Object Catching

Authors: Weiguang Zhao, Junting Dong, Rui Zhang et al. | Submitted: 2026-03-30 | arXiv: 2603.28427 Categories: cs.RO, cs.CV

Research Background: While teleoperation is well-studied for static grasping and manipulation, dynamic object catching — where objects move before contact — remains underexplored. Pure teleoperation fails due to timing, pose, and force errors, motivating shared autonomy approaches.
研究背景： 雖然遠端操作在靜態抓取和操作方面研究充分，但動態物體接取（物體在接觸前移動）仍未被充分探索。純遠端操作因時序、姿態和力誤差而失敗，激勵了共享自主方法。

Technical Approach: Tele-Catch introduces DAIM (dynamics-aware adaptive integration mechanism) for shared autonomy by fusing glove-based teleoperation signals into the diffusion policy denoising process, adaptively modulating control based on interaction object state. DP-U3R integrates unsupervised geometric representations from point cloud observations for geometry-aware decision making.
技術方法： Tele-Catch 引入 DAIM（動態感知自適應整合機制），透過將基於手套的遠端操作信號融合到擴散策略去噪過程中實現共享自主，根據互動物體狀態自適應調節控制。DP-U3R 從點雲觀測整合無監督幾何表示用於幾何感知決策。

Key Takeaway: Tele-Catch significantly improves accuracy and robustness in dynamic catching tasks while generalizing to different dexterous hand embodiments and previously unseen object categories.
核心發現： Tele-Catch 顯著提升動態接取任務的準確性和魯棒性，同時泛化到不同的靈巧手本體和以前未見過的物體類別。

Active Stereo-Camera Outperforms Multi-Sensor Setup in ACT Imitation Learning for Humanoid Manipulation

Authors: Robin Kühn, Moritz Schappler, Thomas Seel et al. | Submitted: 2026-03-30 | arXiv: 2603.28422 Categories: cs.RO

Research Background: Teaching humanoid robots via imitation learning requires optimal sensor selection, but there is no consensus on what sensory hardware configurations are best for manipulation tasks. Adding more sensors is commonly assumed to help, but this may not hold in data-limited regimes.
研究背景： 透過模仿學習教授人形機器人需要最優感測器選擇，但對於操作任務哪種感測器硬體配置最佳尚無共識。通常假設增加更多感測器有助益，但在資料有限的情況下可能不成立。

Technical Approach: An open-source Unified Ablation Framework uses sensor masking on a comprehensive master dataset to benchmark 14 sensor combinations on the Unitree G1 humanoid robot with three-finger hands across two manipulation tasks, systematically evaluating tactile and proprioceptive modalities alongside active stereo vision.
技術方法： 開源統一消融框架在綜合主資料集上使用感測器遮罩，在兩個操作任務上為配備三指手的 Unitree G1 人形機器人基準測試 14 種感測器組合，系統性評估觸覺和本體感覺模態以及主動立體視覺。

Key Takeaway: A minimal active stereo-camera setup outperformed complex multi-sensor configurations (87.5% and 94.4% success), while adding pressure sensors actually reduced success to 67.3% due to low SNR — showing that strategic sensor selection is critical for data-limited IL.
核心發現： 最簡主動立體相機設置優於複雜多感測器配置（87.5% 和 94.4% 成功率），而添加壓力感測器因低信噪比實際上將成功率降至 67.3%，顯示戰略性感測器選擇對資料有限的模仿學習至關重要。

Authors: Zhihao Lv, Xiaoyong Zhang, Mengfan Zhang et al. | Submitted: 2026-03-30 | arXiv: 2603.28362 Categories: cs.RO

Research Background: Multimodal locomotion is essential for navigating confined and unstructured environments like the gastrointestinal tract. Existing small-scale soft robots lack the combination of agility, multimodal capability, and space adaptability needed for such biomedical applications.
研究背景： 多模態運動對於在腸胃道等狹窄非結構化環境中導航至關重要。現有小型軟性機器人缺乏此類生醫應用所需的敏捷性、多模態能力和空間適應性的組合。

Technical Approach: M-SEMR (Multimodal Soft Electromagnetic Robot) features a six-spoke elastomer body with liquid metal channels driven by Laplace forces under a static magnetic field. It achieves over nine locomotion modes with rapid transitions (<0.35 s) and can fold to reduce volume by 79% for traversing confined spaces.
技術方法： M-SEMR（多模態軟性電磁機器人）具有在靜態磁場下由洛倫茲力驅動的含液態金屬通道的六輻彈性體主體。它實現超過九種運動模式，轉換迅速（<0.35 秒），並能折疊縮小體積 79% 以穿越狹窄空間。

Key Takeaway: M-SEMR achieves exceptional agility including 818 mm/s rolling speed and successfully navigates complex terrains including viscoelastic surfaces and simulated biological tissues, offering a versatile strategy for high-mobility biomedical soft robots.
核心發現： M-SEMR 實現卓越的敏捷性，包括 818 mm/s 的滾動速度，成功導航包括黏彈性表面和模擬生物組織在內的複雜地形，為高機動性生醫軟性機器人提供了多功能策略。

CARLA-Air: Fly Drones Inside a CARLA World — A Unified Infrastructure for Air-Ground Embodied Intelligence

Authors: Tianle Zeng, Hanxuan Chen, Yanci Wen et al. | Submitted: 2026-03-30 | arXiv: 2603.28032 Categories: cs.RO, cs.AI, cs.CV, cs.HC

Research Background: Growing demand for air-ground cooperative embodied AI systems lacks simulation infrastructure that jointly models aerial and ground agents in a single physically coherent environment. Existing platforms are domain-segregated, forcing bridge-based co-simulation with synchronization overhead.
研究背景： 日益增長的空地協同具身 AI 系統需求缺乏在單一物理一致環境中聯合建模空中和地面代理的模擬基礎設施。現有平台存在域分隔問題，迫使使用帶同步開銷的橋接共同模擬。

Technical Approach: CARLA-Air unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process, preserving both CARLA and AirSim native Python APIs and ROS 2 interfaces. A shared physics tick and rendering pipeline delivers photorealistic environments with rule-compliant traffic, pedestrians, and UAV dynamics, capturing up to 18 sensor modalities synchronously.
技術方法： CARLA-Air 在單一虛幻引擎進程中統一高保真城市駕駛和物理精確多旋翼飛行，保留 CARLA 和 AirSim 原生 Python API 及 ROS 2 介面。共享物理時步和渲染管線提供帶合規交通、行人和無人機動態的逼真環境，同步捕獲最多 18 種感測器模態。

Key Takeaway: CARLA-Air provides a unified open-source infrastructure for air-ground embodied AI research spanning cooperation, navigation, VLA, and RL policy training, inheriting and extending AirSim’s capabilities within a modern simulation framework.
核心發現： CARLA-Air 為涵蓋協作、導航、VLA 和 RL 策略訓練的空地具身 AI 研究提供統一的開源基礎設施，在現代模擬框架中繼承並擴展 AirSim 的功能。

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

Authors: Chanyoung Kim, Minwoo Kim, Minseok Kang et al. | Submitted: 2026-03-30 | arXiv: 2603.28301 Categories: cs.LG

Research Background: VLA models fine-tuned with limited data tend to overfit to specific instruction formulations, but robustness to paraphrased instructions has been underexplored. This matters because real users express the same task in many different ways.
研究背景： 使用有限資料微調的 VLA 模型傾向於過擬合特定指令形式，但對改寫指令的魯棒性尚未被充分探索。這很重要，因為真實使用者會以多種不同方式表達相同任務。

Technical Approach: LIBERO-Para independently varies action expressions and object references for fine-grained analysis of linguistic generalization across 7 VLA configurations (0.6B–7.5B parameters). The PRIDE metric quantifies paraphrase difficulty using semantic and syntactic factors to reveal whether models rely on easier paraphrases or achieve consistent robustness.
技術方法： LIBERO-Para 獨立改變動作表達和物體參考，用於在 7 個 VLA 配置（0.6B-7.5B 參數）上對語言泛化進行細粒度分析。PRIDE 指標使用語義和句法因素量化改寫難度，以揭示模型是依賴較簡單的改寫還是實現一致的魯棒性。

Key Takeaway: Paraphrasing causes 22–52 percentage point performance degradation across all tested VLA models, with 80–96% of failures arising from planning-level trajectory divergence rather than execution errors, revealing deep reliance on surface-level instruction matching.
核心發現： 改寫在所有測試 VLA 模型中導致 22-52 個百分點的性能下降，其中 80-96% 的失敗源於規劃層面的軌跡偏差而非執行錯誤，揭示了對表層指令匹配的深度依賴。

HandX: Scaling Bimanual Motion and Interaction Generation

Authors: Zimu Zhang, Yucheng Zhang, Xiyan Xu et al. | Submitted: 2026-03-30 | arXiv: 2603.28766 Categories: cs.CV

Research Background: Realistic hand motion synthesis and bimanual interaction generation remain underexplored compared to whole-body motion, with existing datasets lacking high-fidelity bimanual sequences that capture nuanced finger dynamics and coordination.
研究背景： 與全身運動相比，真實的手部運動合成和雙手互動生成仍未被充分探索，現有資料集缺乏捕捉細微手指動態和協調的高保真雙手序列。

Technical Approach: HandX consolidates and filters existing datasets while collecting new motion-capture data targeting bimanual interactions with detailed finger dynamics. A decoupled annotation strategy extracts representative motion features (contact events, finger flexion) and leverages LLM reasoning for fine-grained semantic descriptions, benchmarking diffusion and autoregressive models with versatile conditioning.
技術方法： HandX 整合和過濾現有資料集，同時收集針對含詳細手指動態的雙手互動的新動作捕捉資料。解耦標注策略提取代表性運動特徵（接觸事件、手指彎曲）並利用 LLM 推理生成細粒度語義描述，使用多樣條件基準測試擴散和自回歸模型。

Key Takeaway: HandX reveals clear scaling trends where larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion, establishing a strong foundation for dexterous bimanual motion generation research.
核心發現： HandX 揭示了明確的擴展趨勢，在更大、更高品質資料集上訓練的更大模型產生語義上更連貫的雙手運動，為靈巧雙手運動生成研究奠定了堅實基礎。

Learning Multi-View Spatial Reasoning from Cross-View Relations

Authors: Suchae Jeong, Jaehwi Song, Haeone Lee et al. | Submitted: 2026-03-30 | arXiv: 2603.27967 Categories: cs.CV

Research Background: Vision-language models excel at single-view tasks but lack multi-view spatial reasoning essential for embodied AI to understand 3D environments and manipulate objects across different viewpoints.
研究背景： 視覺語言模型在單視角任務上表現出色，但缺乏具身 AI 理解三維環境和跨不同視角操作物體所必需的多視角空間推理能力。

Technical Approach: The Cross-View Relations (XVR) dataset provides 100K VQA samples from 18K diverse 3D scenes and 70K robotic manipulation trajectories, spanning three spatial reasoning tasks: Correspondence (matching objects across views), Verification (validating spatial relationships), and Localization (identifying object positions). VLMs fine-tuned on XVR serve as VLA backbones in downstream manipulation tasks.
技術方法： 跨視角關係（XVR）資料集提供來自 18K 多樣三維場景和 70K 機器人操作軌跡的 10 萬個 VQA 樣本，涵蓋三種空間推理任務：對應（跨視角匹配物體）、驗證（驗證空間關係）和定位（識別物體位置）。在 XVR 上微調的 VLM 作為下游操作任務中的 VLA 骨幹。

Key Takeaway: XVR-trained VLMs achieve substantial improvements on multi-view reasoning benchmarks and improve VLA success rates on RoboCasa, demonstrating that explicit cross-view spatial relation training transfers effectively to robotic manipulation.
核心發現： XVR 訓練的 VLM 在多視角推理基準上取得顯著提升，並提高了 RoboCasa 上的 VLA 成功率，證明明確的跨視角空間關係訓練有效遷移到機器人操作。

Explorer

arXiv Digest — 2026-W15

arXiv Weekly Digest — Week 15, 2026

Graph View

Table of Contents