arXiv Weekly Digest — Week 16, 2026

Fetched: 2026-04-13 | Categories: cs.RO, cs.LG, cs.HC, cs.CV | Papers: 44

VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

Authors: Xiaolei Lang, Yang Wang, Yukun Zhou, et al. | Submitted: 2026-04-10 | arXiv: 2604.09330 Categories: cs.RO, cs.CV

Research Background: Scaling robot foundation models requires large-scale demonstration data, but teleoperation collection is expensive and slow. Synthetic data generation is a promising alternative, but existing world models produce videos without paired action trajectories. 研究背景： 機器人基礎模型的擴展需要大量示範資料，但遠端操作收集成本高昂。合成資料是有前景的替代方案，但現有的世界模型只能生成影片而無法提供配對的動作軌跡。

Technical Approach: VAG proposes a flow-matching-based dual-stream framework that jointly generates video and action from visual and language conditioning. An adaptive 3D pooling mechanism transfers global video context to the action branch, synchronizing denoising across both modalities to improve cross-modal consistency. 技術方法： VAG 提出基於 flow matching 的雙流框架，同時從視覺和語言條件生成影片與動作。自適應 3D 池化機制將全局影片上下文傳遞至動作分支，同步兩個模態的去噪過程以提升跨模態一致性。

Key Takeaway: VAG produces aligned video-action pairs that improve downstream policy generalization in both simulated and real-world settings. 核心發現： VAG 生成的對齊影片-動作對在模擬和真實場景中均能改善下游策略的泛化能力。

Multimodal Anomaly Detection for Human-Robot Interaction

Authors: Guilherme Ribeiro, Iordanis Antypas, Leonardo Bizzaro, et al. | Submitted: 2026-04-10 | arXiv: 2604.09326 Categories: cs.RO, cs.CV

Research Background: Safe HRI requires timely detection of unexpected events that could cause system failures or unsafe behaviors during collaborative tasks. Existing reconstruction-based anomaly detection models in HRI rely on single modalities and miss important cross-modal signals. 研究背景： 安全的人機互動需要及時偵測可能導致系統故障或不安全行為的意外事件。現有 HRI 中的異常偵測模型依賴單一模態，會遺漏重要的跨模態信號。

Technical Approach: The paper proposes a multimodal anomaly detection approach combining visual and force/torque signals from collaborative robot tasks. A cross-attention fusion mechanism identifies anomalies by detecting inconsistencies across modalities that indicate unexpected physical interactions. 技術方法： 論文提出結合協作機器人任務中視覺和力/力矩信號的多模態異常偵測方法。交叉注意力融合機制通過偵測跨模態的不一致性來識別代表意外物理互動的異常。

Key Takeaway: Multimodal fusion substantially outperforms single-modality baselines in detecting HRI anomalies, improving both precision and recall for safety-critical events. 核心發現： 多模態融合在偵測 HRI 異常方面大幅優於單模態基線，提升安全關鍵事件的精確率和召回率。

2D or 3D: Who Governs Salience in VLA Models? — Tri-Stage Token Pruning Framework with Modality Salience Awareness

Authors: Zihao Zheng, Sicheng Tian, Zhihao Mao, et al. | Submitted: 2026-04-10 | arXiv: 2604.09244 Categories: cs.MM, cs.CV, cs.RO

Research Background: Modern VLA models are expanding from 2D-only to 2D+3D multi-visual-modal inputs for better spatial perception, but this dramatically increases token count and inference latency, demanding efficient acceleration methods. 研究背景： 現代 VLA 模型從純 2D 擴展到 2D+3D 多視覺模態輸入以提升空間感知，但這大幅增加了 token 數量和推論延遲，需要高效的加速方法。

Technical Approach: The paper analyzes how 2D and 3D modalities contribute differently to salience across the forward pass, then proposes a tri-stage token pruning framework: early pruning on single-modality redundancy, mid-stage cross-modal salience comparison, and late-stage task-specific compression. 技術方法： 論文分析 2D 和 3D 模態在前向傳播中對顯著性的不同貢獻，提出三階段 token 剪枝框架：早期對單模態冗餘進行剪枝、中期跨模態顯著性比較，以及後期任務特定壓縮。

Key Takeaway: The tri-stage pruning framework reduces VLA inference tokens by over 60% with less than 1% success-rate degradation on standard manipulation benchmarks. 核心發現： 三階段剪枝框架在標準操作基準上使 VLA 推論 token 減少超過 60%，成功率下降不到 1%。

V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation

Authors: Yaru Liu, Ao-bo Wang, Nanyang Ye | Submitted: 2026-04-10 | arXiv: 2604.09036 Categories: cs.RO

Research Background: Scaling VLA models requires massive datasets that are semantically coherent and physically feasible. Existing scene generation methods often lack context awareness, producing unreachable targets or physically implausible configurations that cause task failures during training. 研究背景： 擴展 VLA 模型需要語義連貫且物理上可行的大型資料集。現有場景生成方法通常缺乏上下文感知，產生不可達目標或物理上不合理的配置，導致訓練時任務失敗。

Technical Approach: V-CAGE is a closed-loop generation engine that iteratively uses visual feedback to validate and refine generated scenes before they are added to the training corpus. An agentic controller checks physical feasibility and semantic consistency of generated scenarios against the robot’s actual workspace. 技術方法： V-CAGE 是一個閉環生成引擎，在場景加入訓練語料庫前，迭代使用視覺回饋來驗證和精煉生成的場景。一個智能控制器針對機器人的實際工作空間檢查生成場景的物理可行性和語義一致性。

Key Takeaway: Closed-loop scene validation reduces infeasible training examples by a large margin, leading to VLA models with improved success rates on novel manipulation tasks. 核心發現： 閉環場景驗證大幅減少不可行的訓練樣本，使 VLA 模型在新穎操作任務上的成功率提升。

Generative Simulation for Policy Learning in Physical Human-Robot Interaction

Authors: Junxiang Wang, Xinwen Xu, Tiancheng Wu, et al. | Submitted: 2026-04-09 | arXiv: 2604.08664 Categories: cs.RO

Research Background: Developing autonomous physical HRI systems is limited by scarcity of large-scale training data for learning robust robot behaviors. Real-world pHRI data collection is costly and raises safety concerns, creating a bottleneck for policy learning. 研究背景： 開發自主物理人機互動系統受限於缺乏用於學習穩健機器人行為的大規模訓練資料。真實世界 pHRI 資料收集成本高且存在安全顧慮，造成策略學習的瓶頸。

Technical Approach: A zero-shot “text2sim2real” generative simulation framework automatically synthesizes diverse pHRI scenarios from natural language prompts using LLMs for scenario parameterization, physics simulation for interaction generation, and domain randomization for sim-to-real transfer. 技術方法： 零樣本「text2sim2real」生成模擬框架從自然語言提示自動合成多樣化 pHRI 場景，利用 LLM 進行場景參數化、物理模擬生成互動，以及領域隨機化促進 sim-to-real 遷移。

Key Takeaway: The text2sim2real framework enables policy learning for pHRI tasks with no real-world data, achieving competitive performance compared to methods trained on real demonstrations. 核心發現： text2sim2real 框架無需真實世界資料即可進行 pHRI 任務的策略學習，達到與使用真實示範訓練方法相當的表現。

SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

Authors: Yunsong Zhou, Hangxu Liu, Xuekun Jiang, et al. | Submitted: 2026-04-09 | arXiv: 2604.08544 Categories: cs.RO, cs.AI, cs.CV

Research Background: Robotic manipulation of deformable objects is data-intensive since shape, contact, and topology co-evolve in complex ways. Existing sim-to-real pipelines are rooted in rigid-body abstractions that produce geometry mismatches and fragile soft dynamics when applied to deformables. 研究背景： 機器人操作可變形物體需要大量資料，因為形狀、接觸和拓撲結構以複雜方式共同演化。現有 sim-to-real 管線基於剛體抽象，應用於可變形物體時產生幾何不匹配和脆弱的軟體動力學。

Technical Approach: SIM1 uses a physics-aligned simulator that models deformable object dynamics at a fidelity sufficient for zero-shot transfer. A neural residual correction layer bridges the remaining sim-to-real gap, scaling training data generation without real-world collection. 技術方法： SIM1 使用物理對齊的模擬器，以足夠的保真度建模可變形物體動力學以實現零樣本遷移。神經殘差修正層彌合剩餘的 sim-to-real 差距，無需真實世界收集即可擴展訓練資料生成。

Key Takeaway: SIM1 enables zero-shot policy transfer to real deformable-object manipulation tasks, outperforming simulation baselines that lack physics-aligned deformable modeling. 核心發現： SIM1 實現零樣本策略遷移至真實可變形物體操作任務，優於缺乏物理對齊可變形建模的模擬基線。

ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration

Authors: Yanwen Zou, Chenyang Shi, Wenye Yu, et al. | Submitted: 2026-04-09 | arXiv: 2604.08534 Categories: cs.RO

Research Background: Large-scale robot data collection requires bridging the embodiment gap between humans and robots. Existing pipelines rely on specialized handheld devices that burden operators and fail to capture the naturally coordinated perception-manipulation behaviors of human daily interactions. 研究背景： 大規模機器人資料收集需要彌合人類與機器人之間的體態差距。現有管線依賴專用手持設備，增加操作者負擔，且無法捕捉人類日常互動中自然協調的感知-操作行為。

Technical Approach: ActiveGlasses is an egocentric data collection system using smart glasses with active head-tracking cameras. An active vision policy learns to align the robot’s camera perspective with the human’s natural gaze direction during imitation learning, closing the perception loop. 技術方法： ActiveGlasses 是使用具有主動頭部追蹤攝影機的智慧眼鏡的第一人稱資料收集系統。主動視覺策略學習在模仿學習期間將機器人的攝影機視角與人類的自然注視方向對齊，閉合感知迴路。

Key Takeaway: Ego-centric demonstration collection with active vision alignment significantly improves policy performance on manipulation tasks requiring precise visual feedback. 核心發現： 結合主動視覺對齊的第一人稱示範收集在需要精確視覺回饋的操作任務上顯著提升策略表現。

BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields

Authors: Fan Yang, Wenrui Chen, Guorun Yan, et al. | Submitted: 2026-04-09 | arXiv: 2604.08410 Categories: cs.CV, cs.RO

Research Background: Functional dexterous grasping in unstructured environments requires integrating semantic understanding, 3D spatial localization, and physically interpretable execution. Modular hierarchical approaches are more controllable than end-to-end VLA but still rely on predefined affordance labels. 研究背景： 非結構化環境中的功能性靈巧抓取需要整合語義理解、3D 空間定位和物理可解釋執行。模組化層次方法比端對端 VLA 更可控，但仍依賴預定義的可供性標籤。

Technical Approach: BLaDA builds a language-to-action pipeline within 3D Gaussian Splatting (3DGS) fields. Language instructions are grounded to functional parts by querying the 3DGS scene representation, which provides rich semantic-geometric coupling for dexterous grasp planning without predefined labels. 技術方法： BLaDA 在 3D Gaussian Splatting（3DGS）場景中建立語言到動作的管線。語言指令通過查詢 3DGS 場景表示與功能部件建立對應關係，為靈巧抓取規劃提供豐富的語義-幾何耦合，無需預定義標籤。

Key Takeaway: Grounding language instructions directly in 3DGS fields enables open-vocabulary dexterous grasping without predefined affordance annotations, generalizing to novel objects and tasks. 核心發現： 直接在 3DGS 場景中建立語言指令對應關係，無需預定義可供性標注即可實現開放詞彙的靈巧抓取，並泛化到新物體和任務。

A Unified Multi-Layer Framework for Skill Acquisition from Imperfect Human Demonstrations

Authors: Zi-Qi Yang, Mehrdad R. Kermani | Submitted: 2026-04-09 | arXiv: 2604.08341 Categories: cs.RO

Research Background: HRI systems for robot skill teaching are fragmented — no existing framework is simultaneously efficient, intuitive, and universally safe. Real demonstrations are often imperfect, requiring systems that can handle noise and variability in human input. 研究背景： 機器人技能教學的人機互動系統是碎片化的，現有框架無法同時具備高效、直觀和普遍安全的特性。真實示範通常不完美，需要能夠處理人類輸入中噪聲和變異性的系統。

Technical Approach: A layered control framework enables compliant LfD built on a foundation of safety guarantees. The multi-layer architecture separates task-level skill learning from joint-level safety enforcement, allowing noisy demonstrations to be filtered and refined at each layer before policy update. 技術方法： 分層控制框架在安全保證的基礎上實現順應性示範學習。多層架構將任務級技能學習與關節級安全強制分離，允許在策略更新前在每層過濾和精煉嘈雜的示範。

Key Takeaway: The multi-layer LfD framework achieves robust skill acquisition from imperfect demonstrations while maintaining safety constraints throughout the learning process. 核心發現： 多層 LfD 框架在保持學習過程中的安全約束的同時，從不完美示範中實現穩健的技能習得。

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

Authors: Jindi Lv, Hao Li, Jie Li, et al. | Submitted: 2026-04-09 | arXiv: 2604.08168 Categories: cs.RO, cs.AI

Research Background: VLA models advance robot manipulation via large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. RL value functions can guide policy improvement, but existing VLM-based value models struggle to capture temporal task dynamics. 研究背景： VLA 模型通過大規模預訓練推進機器人操作，但因部分可觀測性和延遲回饋，真實世界部署仍具挑戰性。強化學習價值函數可以指導策略改進，但現有基於 VLM 的價值模型難以捕捉時序任務動態。

Technical Approach: ViVa is a video-generative value model that imagines future video rollouts from the current observation and uses the generated trajectory to estimate long-horizon value. The video-generation backbone captures physical dynamics implicitly, providing richer temporal context than single-frame value models. 技術方法： ViVa 是一個影片生成式價值模型，從當前觀測想像未來影片滾動，並使用生成的軌跡估計長期價值。影片生成骨幹隱式捕捉物理動態，提供比單幀價值模型更豐富的時序上下文。

Key Takeaway: Video-generative value estimation substantially improves RL policy learning for robot manipulation, particularly on long-horizon tasks with sparse rewards. 核心發現： 影片生成式價值估計大幅改善機器人操作的強化學習策略學習，特別是在稀疏獎勵的長期任務上。

HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

Authors: Shuanghao Bai, Meng Li, Xinyuan Lv, et al. | Submitted: 2026-04-09 | arXiv: 2604.07993 Categories: cs.RO

Research Background: Most VLA models treat robot body parts independently, making coordinated high-DoF humanoid whole-body control challenging and often unstable. Leveraging diverse robotic data across embodiments for humanoid training remains an open problem. 研究背景： 大多數 VLA 模型獨立處理機器人身體部位，使得協調的高自由度類人型全身控制具有挑戰性且通常不穩定。如何利用跨體態的多樣機器人資料進行類人型訓練仍是未解決的問題。

Technical Approach: HEX introduces a humanoid-aligned universal state representation that maps diverse robot embodiments to a common canonical space. Expert policies are trained per body-region (arms, torso, base), then composed by a cross-embodiment mixture coordinator that aligns expert outputs to the target humanoid’s kinematics. 技術方法： HEX 引入類人對齊的通用狀態表示，將多樣機器人體態映射到共同的標準空間。針對每個身體區域（手臂、軀幹、底座）訓練專家策略，然後由跨體態混合協調器組合，對齊專家輸出到目標類人型的運動學。

Key Takeaway: HEX achieves stable whole-body humanoid manipulation by leveraging cross-embodiment data, outperforming single-embodiment trained baselines on complex bimanual tasks. 核心發現： HEX 通過利用跨體態資料實現穩定的類人型全身操作，在複雜的雙臂任務上優於單體態訓練基線。

Learning Without Losing Identity: Capability Evolution for Embodied Agents

Authors: Xue Qin, Simin Luan, John See, et al. | Submitted: 2026-04-09 | arXiv: 2604.07799 Categories: cs.RO, cs.AI

Research Background: Embodied agents must continuously acquire new capabilities in dynamic environments without destabilizing existing behavior. Current approaches — prompt engineering, policy updates, or structural redesign — lead to identity instability and capability regression in long-lived systems. 研究背景： 具身智能體必須在動態環境中持續習得新能力，而不破壞其現有行為。當前方法（提示工程、策略更新或結構重設計）導致長期系統中的身份不穩定和能力退化。

Technical Approach: The paper proposes a capability evolution framework that separates core identity representations from task-specific skill modules. New skills are learned in isolation and integrated via a gating mechanism that preserves the agent’s existing behavioral profile, preventing catastrophic forgetting. 技術方法： 論文提出能力演化框架，將核心身份表示與任務特定技能模組分離。新技能獨立學習並通過保留智能體現有行為特徵的門控機制整合，防止災難性遺忘。

Key Takeaway: Capability evolution with identity preservation enables embodied agents to continuously learn new skills while maintaining stable performance on previously mastered tasks. 核心發現： 具身智能體通過保留身份的能力演化框架，能夠在保持先前掌握任務的穩定表現的同時持續學習新技能。

LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

Authors: Jingjing Wang, Zhengdong Hong, Chong Bao, et al. | Submitted: 2026-04-09 | arXiv: 2604.08475 Categories: cs.CV

Research Background: Human-like generalization in open-world robotic manipulation remains a fundamental challenge. Existing learning-based methods including RL, imitation learning, and VLAs often struggle with novel tasks and unseen environments lacking fine-grained spatial understanding. 研究背景： 開放世界機器人操作中的類人泛化仍是根本挑戰。現有的學習方法包括強化學習、模仿學習和 VLA，常常在缺乏細粒度空間理解的新任務和未見環境中掙扎。

Technical Approach: LAMP lifts 2D image-editing diffusion priors into 3D geometric representations usable by manipulation policies. Goal images are generated by editing the current observation, then lifted into 3D point clouds that serve as spatial goals for a closed-loop visuomotor policy. 技術方法： LAMP 將 2D 圖像編輯擴散先驗提升為操作策略可使用的 3D 幾何表示。通過編輯當前觀測生成目標圖像，然後提升為作為閉環視覺運動策略空間目標的 3D 點雲。

Key Takeaway: Using image-editing models as 3D priors enables zero-shot generalization to novel manipulation tasks without requiring task-specific data collection. 核心發現： 將圖像編輯模型用作 3D 先驗，無需任務特定資料收集即可實現對新穎操作任務的零樣本泛化。

EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World

Authors: Ryan Punamiya, Simar Kareer, Zeyi Liu, et al. | Submitted: 2026-04-08 | arXiv: 2604.07607 Categories: cs.RO, cs.CV

Research Background: Robot learning increasingly depends on large, diverse data but robot data collection is expensive. Egocentric human video data offers a promising alternative capturing rich manipulation behavior, but existing datasets are limited in scope and fragmented across institutions. 研究背景： 機器人學習越來越依賴大規模多樣化資料，但機器人資料收集成本高昂。第一人稱人類影片資料提供了捕捉豐富操作行為的有前景替代方案，但現有資料集範圍有限且分散在各機構。

Technical Approach: EgoVerse is a large-scale collection framework aggregating egocentric human manipulation data globally. Standardized video processing, hand-pose annotation, and object segmentation pipelines extract robot-transferable manipulation signals from diverse cultural and environmental contexts. 技術方法： EgoVerse 是一個聚合全球第一人稱人類操作資料的大規模收集框架。標準化影片處理、手部姿態標注和物體分割管線從多樣化的文化和環境背景中提取可遷移至機器人的操作信號。

Key Takeaway: Pre-training robot policies on EgoVerse’s diverse egocentric human data improves generalization to novel manipulation environments compared to robot-only dataset pre-training. 核心發現： 在 EgoVerse 多樣化第一人稱人類資料上預訓練機器人策略，相比僅使用機器人資料集預訓練，可改善對新穎操作環境的泛化能力。

Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations

Authors: Chao Tang, Jiacheng Xu, Haofei Lu, et al. | Submitted: 2026-04-08 | arXiv: 2604.07517 Categories: cs.RO

Research Background: Generalizable functional grasping in open-world environments is challenging due to vast object and task diversity. Existing methods are either limited to narrow object sets or require prohibitively large real-world data collection. 研究背景： 開放世界環境中的通用功能性抓取因物體和任務的巨大多樣性而具有挑戰性。現有方法要麼僅限於狹窄的物體集，要麼需要收集成本高昂的大規模真實世界資料。

Technical Approach: GraspDream generates synthetic human hand demonstrations for any target object and grasping intent using a video diffusion model. The generated demonstrations are retargeted to robot end-effectors and used for imitation learning, sidestepping real-world data collection entirely. 技術方法： GraspDream 使用影片擴散模型為任何目標物體和抓取意圖生成合成人類手部示範。生成的示範被重定向到機器人末端執行器並用於模仿學習，完全繞過真實世界資料收集。

Key Takeaway: Imitating functional grasping from generated human demonstrations achieves competitive generalization to novel objects without any real-world grasping demonstrations. 核心發現： 從生成的人類示範中模仿功能性抓取，無需任何真實世界抓取示範即可實現對新物體的競爭性泛化。

TAMEn: Tactile-Aware Manipulation Engine for Closed-Loop Data Collection in Contact-Rich Tasks

Authors: Longyan Wu, Jieji Ren, Chenghang Jiang, et al. | Submitted: 2026-04-08 | arXiv: 2604.07335 Categories: cs.RO

Research Background: Contact-rich bimanual manipulation data collection via handheld paradigms is hindered by hardware adaptability and data efficacy limitations. Prior designs are gripper-specific and face trade-offs between tracking precision and tactile feedback. 研究背景： 通過手持範式收集接觸豐富的雙臂操作資料受到硬體適應性和資料效能限制的阻礙。先前的設計是針對特定夾具的，並面臨追蹤精度和觸覺回饋之間的權衡。

Technical Approach: TAMEn is a hardware-software system combining a gripper-agnostic tactile sensing module with a closed-loop data collection pipeline. Tactile feedback lets the operator sense contact forces during demonstration, while the closed-loop controller rejects demonstrations with poor contact profiles. 技術方法： TAMEn 是結合夾具無關觸覺感知模組和閉環資料收集管線的軟硬體系統。觸覺回饋使操作者能夠在示範期間感知接觸力，而閉環控制器拒絕接觸特徵不佳的示範。

Key Takeaway: Tactile-aware closed-loop data collection yields significantly higher quality demonstration data for contact-rich bimanual tasks, improving downstream policy success rates. 核心發現： 觸覺感知的閉環資料收集為接觸豐富的雙臂任務提供顯著更高品質的示範資料，改善下游策略成功率。

Intuitive Human-Robot Interaction: Development and Evaluation of a Gesture-Based User Interface for Object Selection

Authors: Bijan Kavousian, Oliver Petrovic, Werner Herfs | Submitted: 2026-04-07 | arXiv: 2604.06073 Categories: cs.RO, cs.HC

Research Background: Gestures are a natural form of human communication that can be leveraged for intuitive HRI. Existing robot instruction interfaces rely on keyboards, pendants, or voice commands that require training and reduce collaboration naturalness. 研究背景： 手勢是人類自然的溝通形式，可以用於直觀的人機互動。現有的機器人指令介面依賴鍵盤、示教器或語音指令，需要培訓且降低協作的自然性。

Technical Approach: A gesture-based user interface uses pointing and click gestures for object selection in collaborative robot tasks. The system combines hand tracking, gaze estimation, and a probabilistic object selector to resolve pointing ambiguity and confirm selections via gesture confirmation. 技術方法： 基於手勢的使用者介面使用指向和點擊手勢在協作機器人任務中選擇物體。系統結合手部追蹤、注視估計和機率物體選擇器來解決指向歧義，並通過手勢確認來確認選擇。

Key Takeaway: A 20-participant study demonstrates the gesture interface achieves high accuracy and competitive selection times, supporting efficient gesture-based HRI without dedicated training. 核心發現： 20 名參與者的研究表明，手勢介面達到高準確率和具競爭力的選擇時間，支持無需專門培訓的高效手勢 HRI。

HiPolicy: Hierarchical Multi-Frequency Action Chunking for Policy Learning

Authors: Jiyao Zhang, Zimu Han, Junhan Wang, et al. | Submitted: 2026-04-07 | arXiv: 2604.06067 Categories: cs.RO

Research Background: Robotic imitation learning faces a trade-off between modeling long-horizon dependencies and enabling fine-grained closed-loop control. Fixed-frequency action chunking approaches struggle to capture both coarse task structure and precise manipulation dynamics simultaneously. 研究背景： 機器人模仿學習面臨建模長期依賴和啟用細粒度閉環控制之間的權衡。固定頻率動作分塊方法難以同時捕捉粗粒度任務結構和精確操作動態。

Technical Approach: HiPolicy introduces hierarchical multi-frequency action chunking that jointly predicts action sequences at multiple temporal scales. A high-frequency fine-motion predictor and a low-frequency task-level planner are jointly trained and synchronized through a cross-frequency attention bridge. 技術方法： HiPolicy 引入層次多頻率動作分塊，在多個時序尺度上聯合預測動作序列。高頻細動作預測器和低頻任務級規劃器通過跨頻率注意力橋接器聯合訓練和同步。

Key Takeaway: Multi-frequency action chunking resolves the precision-horizon trade-off, achieving state-of-the-art performance on long-horizon manipulation benchmarks requiring precise contact control. 核心發現： 多頻率動作分塊解決精度-時域權衡問題，在需要精確接觸控制的長期操作基準上達到最佳表現。

A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

Authors: Kaidong Zhang, Jian Zhang, Rongtao Xu, et al. | Submitted: 2026-04-07 | arXiv: 2604.05672 Categories: cs.RO

Research Background: VLA models are powerful for open-world robot manipulation, but billion-scale VLM backbones and iterative diffusion/flow-based action heads incur high latency and compute that make real-time control expensive on commodity hardware. 研究背景： VLA 模型在開放世界機器人操作中很強大，但十億規模的 VLM 骨幹和迭代擴散/flow 動作頭產生高延遲和計算量，使得在通用硬體上的即時控制成本高昂。

Technical Approach: A1 is a fully open-source VLA that truncates the VLM backbone to reduce compute while preserving spatial reasoning capability through a selective layer pruning strategy. An adaptive action head replaces multi-step diffusion with single-pass prediction, achieving low-latency control on consumer GPUs. 技術方法： A1 是一個完全開源的 VLA，通過選擇性層剪枝策略截斷 VLM 骨幹以降低計算量，同時保留空間推理能力。自適應動作頭以單次預測取代多步擴散，在消費級 GPU 上實現低延遲控制。

Key Takeaway: A1 achieves competitive manipulation performance with 10x lower inference latency compared to full-scale VLAs, making real-time VLA deployment feasible on commodity hardware. 核心發現： A1 以比全規模 VLA 低 10 倍的推論延遲實現競爭性操作表現，使消費級硬體上的即時 VLA 部署可行。

Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment

Authors: Theodor Wulff, Federico Tavella, Rahul Singh Maharjan, et al. | Submitted: 2026-04-07 | arXiv: 2604.05614 Categories: cs.RO

Research Background: Robot transparency is critical for effective human-robot collaboration. Hierarchical VLA models can generate language and low-level actions, but current approaches lack explicit grounding between natural language outputs and the corresponding physical actions taken. 研究背景： 機器人透明度對於有效的人機協作至關重要。層次化 VLA 模型可以生成語言和低層動作，但當前方法缺乏自然語言輸出與相應物理動作之間的明確對應關係。

Technical Approach: The paper introduces an explicit language-action alignment training objective that grounds each language segment to its corresponding action subsequence. A dual-decoder architecture jointly generates action tokens and language tokens in a synchronized manner, enforcing mutual consistency through contrastive alignment loss. 技術方法： 論文引入明確的語言-動作對齊訓練目標，將每個語言片段與其對應的動作子序列建立對應關係。雙解碼器架構以同步方式聯合生成動作 token 和語言 token，通過對比對齊損失強制相互一致性。

Key Takeaway: Explicit language-action alignment produces VLA models whose verbal descriptions are faithfully grounded to their actions, improving both robot transparency and manipulation success. 核心發現： 明確的語言-動作對齊產生的 VLA 模型，其口頭描述忠實地對應於其動作，同時提升機器人透明度和操作成功率。

Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming

Authors: Baoshun Tong, Haoran He, Ling Pan, et al. | Submitted: 2026-04-07 | arXiv: 2604.05595 Categories: cs.RO, cs.CV

Research Background: VLA models have achieved remarkable success in robot manipulation, but their robustness to linguistic variations remains a critical underexplored safety concern. Small perturbations in language instructions could cause catastrophic manipulation failures in deployment. 研究背景： VLA 模型在機器人操作中取得了顯著成功，但其對語言變異的穩健性仍是關鍵的未充分探索的安全問題。語言指令中的細微擾動可能在部署時導致災難性的操作失敗。

Technical Approach: A diversity-aware red teaming framework systematically generates linguistically diverse adversarial prompts to probe VLA model failures. The method uses a diversity constraint to maximize coverage of linguistic variation space, identifying fragile instruction patterns across paraphrasing, negation, and compositional complexity. 技術方法： 多樣性感知的對抗測試框架系統性地生成語言多樣化的對抗提示以探測 VLA 模型失敗。該方法使用多樣性約束最大化語言變異空間的覆蓋，識別跨釋義、否定和組合複雜度的脆弱指令模式。

Key Takeaway: Current VLA models exhibit significant linguistic fragility — seemingly minor instruction variations can reduce manipulation success by over 30%, highlighting a critical robustness gap. 核心發現： 當前 VLA 模型表現出顯著的語言脆弱性，看似輕微的指令變異可使操作成功率降低超過 30%，揭示了關鍵的穩健性缺口。

Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation

Authors: Jiahua Ma, Yiran Qin, Xin Wen, et al. | Submitted: 2026-04-07 | arXiv: 2604.05544 Categories: cs.RO, cs.CV

Research Background: Visuomotor policies trained on expert demonstrations struggle to recover from out-of-distribution execution errors or dynamically re-route trajectories mid-task. Standard imitation learning provides no mechanism for closed-loop error recovery. 研究背景： 在專家示範上訓練的視覺運動策略難以從分布外的執行錯誤中恢復，或在任務中途動態重新規劃軌跡。標準模仿學習沒有為閉環錯誤恢復提供機制。

Technical Approach: ReV introduces a referring expression module that grounds current visual observations to task-relevant regions at each timestep. A closed-loop adaptation mechanism dynamically adjusts the policy output based on discrepancies between current observations and the expected state from training demonstrations. 技術方法： ReV 引入指稱表達模組，在每個時間步將當前視覺觀測對應到任務相關區域。閉環自適應機制基於當前觀測和訓練示範預期狀態之間的差異動態調整策略輸出。

Key Takeaway: Referring-aware closed-loop control enables robust recovery from execution errors without requiring additional failure-recovery demonstrations, improving manipulation success rates significantly. 核心發現： 指稱感知的閉環控制在不需要額外失敗恢復示範的情況下實現穩健的執行錯誤恢復，顯著提升操作成功率。

VLA-InfoEntropy: A Training-Free Vision-Attention Information Entropy Approach for VLA Inference Acceleration

Authors: Chuhang Liu, Yayun He, Zuheng Kang, et al. | Submitted: 2026-04-07 | arXiv: 2604.05323 Categories: cs.CV, cs.RO

Research Background: VLA models jointly processing high-dimensional visual features, complex language inputs, and continuous action sequences incur significant computational overhead, limiting real-time robot control. Token reduction methods exist for LLMs but are not directly applicable to VLA cross-modal inference. 研究背景： VLA 模型聯合處理高維視覺特徵、複雜語言輸入和連續動作序列產生顯著的計算開銷，限制了即時機器人控制。LLM 的 token 縮減方法存在，但不能直接應用於 VLA 跨模態推論。

Technical Approach: VLA-InfoEntropy is a training-free approach that estimates the information entropy of vision attention maps at inference time to identify and prune low-information visual tokens. High-entropy regions retain tokens while low-entropy regions are pruned, adaptively allocating compute to visually important scene regions. 技術方法： VLA-InfoEntropy 是一種無需訓練的方法，在推論時估計視覺注意力圖的資訊熵，以識別和剪枝低資訊視覺 token。高熵區域保留 token 而低熵區域被剪枝，自適應地將計算分配給視覺上重要的場景區域。

Key Takeaway: Training-free entropy-guided token pruning reduces VLA inference cost by 40% with negligible performance degradation, enabling deployment on resource-constrained hardware. 核心發現： 無需訓練的熵引導 token 剪枝將 VLA 推論成本降低 40%，性能下降可忽略不計，使在資源受限硬體上的部署成為可能。

ExpressMM: Expressive Mobile Manipulation Behaviors in Human-Robot Interactions

Authors: Souren Pashangpour, Haitong Wang, Matthew Lisondra, et al. | Submitted: 2026-04-07 | arXiv: 2604.05320 Categories: cs.RO

Research Background: Mobile manipulators deployed in human-centered environments need to communicate their intent expressively to surrounding people. Prior work on expressive robot behaviors uses preprogrammed motions or LLM-generated high-level plans that lack natural integration with manipulation actions. 研究背景： 在以人為中心環境中部署的移動操作機器人需要向周圍的人表達性地傳達其意圖。先前關於表達性機器人行為的工作使用預編程動作或 LLM 生成的高層計劃，缺乏與操作動作的自然整合。

Technical Approach: ExpressMM combines LfD-based motion primitives with a learned expressiveness layer that modulates robot body language — velocity profiles, gaze direction, spatial proxemics — to communicate task intent during mobile manipulation. A user-study-driven optimization objective aligns expressive behaviors to human legibility judgments. 技術方法： ExpressMM 將基於 LfD 的運動基元與學習的表達性層相結合，調節機器人的肢體語言（速度輪廓、注視方向、空間近身學），在移動操作期間傳達任務意圖。用戶研究驅動的優化目標將表達性行為對齊到人類可讀性判斷。

Key Takeaway: Expressive mobile manipulation behaviors significantly improve bystander understanding of robot intent, reducing perceived safety concerns without slowing task execution. 核心發現： 表達性移動操作行為顯著改善旁觀者對機器人意圖的理解，在不減慢任務執行的情況下降低感知到的安全顧慮。

SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation

Authors: Wuyang Luan, Junhui Li, Weiguang Zhao, et al. | Submitted: 2026-04-07 | arXiv: 2604.05656 Categories: cs.CV, cs.AI

Research Background: Flow-matching VLAs like pi0, pi0.5, and SmolVLA achieve state-of-the-art generalist robot manipulation, but their iterative denoising (typically 10 ODE steps) accounts for 80% of inference time, creating a significant latency bottleneck for real-time control. 研究背景： 基於 flow matching 的 VLA（如 pi0、pi0.5 和 SmolVLA）達到最先進的通用機器人操作效果，但其迭代去噪（通常 10 個 ODE 步驟）佔推論時間的 80%，為即時控制造成顯著的延遲瓶頸。

Technical Approach: SnapFlow uses progressive self-distillation to compress a multi-step flow-matching VLA into a single-step action generator. A teacher-student distillation curriculum progressively reduces ODE steps from N to 1, with intermediate checkpoints as teachers, preserving action quality at each compression stage. 技術方法： SnapFlow 使用漸進式自蒸餾將多步 flow matching VLA 壓縮為單步動作生成器。教師-學生蒸餾課程逐步將 ODE 步驟從 N 減少到 1，以中間檢查點作為教師，在每個壓縮階段保持動作品質。

Key Takeaway: Progressive self-distillation compresses flow-matching VLAs to single-step inference with minimal performance loss, achieving 10x latency reduction over multi-step baselines. 核心發現： 漸進式自蒸餾將 flow matching VLA 壓縮到單步推論，性能損失最小，相比多步基線實現 10 倍延遲降低。

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

Authors: StarVLA Community | Submitted: 2026-04-06 | arXiv: 2604.05014 Categories: cs.RO, cs.AI, cs.CV

Research Background: Despite rapid progress in VLA research, methods remain fragmented across incompatible architectures, codebases, and evaluation setups. This fragmentation slows progress by making it difficult to compare methods, reproduce results, or combine components across approaches. 研究背景： 儘管 VLA 研究快速進展，但方法仍分散在不相容的架構、代碼庫和評估設置中。這種碎片化使得比較方法、重現結果或跨方法組合組件變得困難，從而減慢了進展。

Technical Approach: StarVLA provides a modular, composable codebase with standardized interfaces for VLM backbones, action heads, training pipelines, and evaluation protocols. Each component is designed as an interchangeable Lego-like block, enabling mix-and-match experimentation across VLA architectures without re-implementing infrastructure. 技術方法： StarVLA 提供模組化、可組合的代碼庫，具有 VLM 骨幹、動作頭、訓練管線和評估協議的標準化介面。每個組件設計為可互換的積木塊，無需重新實現基礎設施即可在 VLA 架構之間進行混搭實驗。

Key Takeaway: StarVLA’s unified codebase enables rapid VLA research iteration, reproducing baseline results with significantly less setup effort and enabling fair cross-architecture comparison. 核心發現： StarVLA 的統一代碼庫支持快速 VLA 研究迭代，以顯著更少的設置工作重現基線結果，並支持公平的跨架構比較。

E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

Authors: Jiajun Zhai, Hao Shi, Shangwei Guo, et al. | Submitted: 2026-04-06 | arXiv: 2604.04834 Categories: cs.CV, cs.MM, cs.RO, eess.IV

Research Background: VLA models generalize well for open-ended manipulation but their visual perception is fragile under sensing degradations like extreme low light, motion blur, and image clipping — common conditions in real-world deployment environments. 研究背景： VLA 模型在開放性操作中泛化良好，但其視覺感知在極低光線、運動模糊和圖像裁剪等感知退化下很脆弱，這些是真實世界部署環境中的常見條件。

Technical Approach: E-VLA augments conventional frame-based VLA with event camera input, which captures high dynamic range signals unaffected by low light or motion blur. Rather than image reconstruction from events, E-VLA uses events directly as a complementary modality fused into the VLA token sequence. 技術方法： E-VLA 用事件相機輸入增強傳統幀式 VLA，事件相機捕捉不受低光線或運動模糊影響的高動態範圍信號。E-VLA 直接將事件作為互補模態融合到 VLA token 序列中，而非從事件重建圖像。

Key Takeaway: Event camera augmentation substantially improves VLA manipulation robustness in low-light and high-motion conditions, with only marginal overhead in normal operating conditions. 核心發現： 事件相機增強在低光線和高運動條件下顯著提升 VLA 操作穩健性，在正常操作條件下僅增加邊際開銷。

AnyUser: Translating Sketched User Intent into Domestic Robots

Authors: Songyuan Yang, Huibin Tan, Kailun Yang, et al. | Submitted: 2026-04-06 | arXiv: 2604.04811 Categories: cs.RO, cs.CV, cs.HC

Research Background: Domestic robot instruction needs to be accessible to non-expert users without requiring technical knowledge or formal command syntax. Current interfaces like voice or touchscreen commands lack spatial expressiveness for specifying precise manipulation goals. 研究背景： 家用機器人指令需要對非專業用戶友好，無需技術知識或正式命令語法。當前的語音或觸屏命令介面缺乏指定精確操作目標的空間表達能力。

Technical Approach: AnyUser interprets free-form sketches drawn on camera images, optionally combined with language, as spatial-semantic manipulation instructions. Novel multimodal fusion components parse sketch strokes and spatial relations from the image context to generate executable robot action sequences without requiring pre-built maps. 技術方法： AnyUser 解釋在相機圖像上繪製的自由形式草圖（可選與語言結合）作為空間語義操作指令。新型多模態融合組件從圖像上下文中解析草圖筆畫和空間關係，無需預建地圖即可生成可執行的機器人動作序列。

Key Takeaway: Sketch-based robot instruction achieves higher spatial precision than language-only interfaces for domestic manipulation tasks, with high user satisfaction among non-expert participants. 核心發現： 基於草圖的機器人指令在家用操作任務中實現比純語言介面更高的空間精度，非專業參與者的用戶滿意度高。

ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration

Authors: Rongfeng Zhao, Xuanhao Zhang, Zhaochen Guo, et al. | Submitted: 2026-04-06 | arXiv: 2604.04664 Categories: cs.RO, cs.AI, cs.MA

Research Background: Integrating LLMs with embodied agents improves high-level reasoning but a critical gap remains between semantic understanding and physical execution. VLA and VLN systems handle single-agent tasks but struggle with heterogeneous multi-agent coordination. 研究背景： 將 LLM 與具身智能體整合改善了高層推理，但語義理解和物理執行之間仍存在關鍵差距。VLA 和 VLN 系統處理單智能體任務，但在異質多智能體協調方面掙扎。

Technical Approach: ROSClaw introduces a hierarchical semantic-physical framework where an LLM-based semantic planner decomposes tasks into sub-goals assigned to heterogeneous robot agents. A physical execution layer with VLA-based atomic action primitives executes sub-goals and reports completion signals back to the semantic coordinator. 技術方法： ROSClaw 引入層次化語義-物理框架，基於 LLM 的語義規劃器將任務分解為分配給異質機器人智能體的子目標。基於 VLA 的原子動作基元物理執行層執行子目標並將完成信號報告回語義協調器。

Key Takeaway: The hierarchical semantic-physical framework enables heterogeneous robot teams to collaborate on complex tasks requiring both navigation and manipulation across diverse environments. 核心發現： 層次化語義-物理框架使異質機器人團隊能夠在需要跨多樣環境的導航和操作的複雜任務上協作。

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Authors: Donghu Kim, Youngdo Lee, Minho Park, et al. | Submitted: 2026-04-06 | arXiv: 2604.04539 Categories: cs.LG, cs.RO

Research Background: RL is core for robot control when expert demonstrations are unavailable. On-policy methods like PPO are stable but limited by narrow data distributions; off-policy methods can overcome this but often suffer instability in high-dimensional robot state/action spaces. 研究背景： 當專家示範不可用時，強化學習是機器人控制的核心方法。PPO 等 on-policy 方法穩定但受窄資料分布限制；off-policy 方法可以克服這一限制，但在高維機器人狀態/動作空間中常常出現不穩定性。

Technical Approach: FlashSAC extends Soft Actor-Critic (SAC) with a flash replay mechanism that prioritizes high-gradient transitions for efficient off-policy learning, and a spectral normalization scheme applied to critic networks that stabilizes Q-value estimation in high-dimensional spaces. 技術方法： FlashSAC 擴展軟演員-評論家（SAC）算法，引入優先高梯度轉換的快閃回放機制和應用於評論家網絡的譜正規化方案，在高維空間穩定 Q 值估計。

Key Takeaway: FlashSAC achieves faster convergence and higher final performance than PPO and standard SAC on high-dimensional robot control tasks including manipulation and locomotion. 核心發現： FlashSAC 在包括操作和運動的高維機器人控制任務上，比 PPO 和標準 SAC 達到更快的收斂和更高的最終性能。

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

Authors: Zhongru Zhang, Chenghan Yang, Qingzhou Lu, et al. | Submitted: 2026-04-06 | arXiv: 2604.04502 Categories: cs.RO

Research Background: Video generation models have advanced rapidly and show strong understanding of physical dynamics. A key open question is whether frontier video generation models like Veo-3 can serve as generalizable robot manipulation controllers by predicting physically plausible future states. 研究背景： 影片生成模型快速進步，展示對物理動態的強大理解。關鍵的開放問題是前沿影片生成模型（如 Veo-3）是否能通過預測物理上合理的未來狀態作為通用機器人操作控制器。

Technical Approach: Veo-Act uses Veo-3 as a predictive world model: given the current robot observation, Veo-3 generates a video of the desired future trajectory, then an inverse dynamics model extracts executable robot actions from the generated sequence. Zero-shot and few-shot adaptation modes are evaluated. 技術方法： Veo-Act 使用 Veo-3 作為預測世界模型：給定當前機器人觀測，Veo-3 生成所需未來軌跡的影片，然後逆動力學模型從生成序列中提取可執行的機器人動作。評估零樣本和少樣本自適應模式。

Key Takeaway: Frontier video models provide strong zero-shot manipulation capabilities in familiar environments but struggle with precise contact tasks, suggesting video generation as a useful but incomplete path to generalizable robot control. 核心發現： 前沿影片模型在熟悉環境中提供強大的零樣本操作能力，但在精確接觸任務上掙扎，表明影片生成是通向通用機器人控制的有用但不完整的路徑。

Robust Adaptive Backstepping Impedance Control of Robots in Unknown Environments

Authors: Reza Nazmara, Alap Kshirsagar, Jan Peters, et al. | Submitted: 2026-04-10 | arXiv: 2604.09323 Categories: cs.RO

Research Background: Robots operating in contact-rich and uncertain environments need control strategies that handle unknown external disturbances and unmodeled dynamics. Standard impedance control assumes known dynamic parameters, limiting deployment in unstructured settings. 研究背景： 在接觸豐富和不確定環境中操作的機器人需要處理未知外部擾動和未建模動態的控制策略。標準阻抗控制假設已知動態參數，限制了在非結構化環境中的部署。

Technical Approach: RABIC accounts for the complete coupled dynamics without requiring prior knowledge of dynamic parameters. Adaptive laws online-estimate robot dynamics and disturbances, while backstepping guarantees stability under the full parameter uncertainty range. 技術方法： RABIC 在不需要動態參數先驗知識的情況下考慮完整的耦合動態。自適應律在線估計機器人動態和擾動，而反步法保證在完整參數不確定性範圍下的穩定性。

Key Takeaway: RABIC maintains stable compliant interaction with unknown contact environments, outperforming existing impedance control methods that assume known or partially-known dynamics. 核心發現： RABIC 在未知接觸環境中保持穩定的順應性互動，優於假設已知或部分已知動態的現有阻抗控制方法。

One Interface, Many Robots: Unified Real-Time Low-Level Motion Planning for Collaborative Arms

Authors: Yue Feng, Weicheng Huang, I-Ming Chen | Submitted: 2026-04-09 | arXiv: 2604.08787 Categories: cs.RO

Research Background: Real-time motion planning for collaborative robotic arms is typically hardware-specific, requiring separate implementations for each robot model. This fragmentation increases development effort and hinders cross-platform deployment. 研究背景： 協作機械臂的即時運動規劃通常是硬體特定的，需要為每個機器人型號進行單獨實現。這種碎片化增加了開發工作量並阻礙了跨平台部署。

Technical Approach: The paper extends WinGs Operating Studio (WOS) with a unified low-level motion planning interface that abstracts heterogeneous robot kinematics into common software resources. Real-time planning algorithms are wrapped in a hardware-agnostic API, enabling the same high-level policy to execute across different collaborative arm platforms. 技術方法： 論文擴展 WinGs Operating Studio（WOS），增加統一的低層運動規劃介面，將異質機器人運動學抽象為通用軟體資源。即時規劃算法包裝在硬體無關的 API 中，使相同的高層策略能夠在不同的協作臂平台上執行。

Key Takeaway: A unified motion planning interface reduces cross-platform integration effort significantly, enabling the same control policies to run across multiple collaborative arm models without modification. 核心發現： 統一的運動規劃介面顯著減少跨平台整合工作量，使相同的控制策略無需修改即可在多個協作臂模型上運行。

A-SLIP: Acoustic Sensing for Continuous In-hand Slip Estimation

Authors: Uksang Yoo, Yuemin Mao, Jean Oh, et al. | Submitted: 2026-04-09 | arXiv: 2604.08528 Categories: cs.RO

Research Background: Reliable in-hand manipulation requires accurate real-time slip estimation between gripper and object. Existing tactile sensing approaches based on vision, capacitance, or force-torque face trade-offs in form factor, durability, and the ability to jointly estimate slip direction and magnitude. 研究背景： 可靠的手內操作需要準確的即時滑動估計。現有的基於視覺、電容或力/力矩的觸覺感知方法在形態因子、耐久性以及聯合估計滑動方向和大小的能力方面面臨權衡。

Technical Approach: A-SLIP uses multi-channel acoustic sensing integrated into the gripper fingertips to capture the characteristic acoustic signatures of slip events. A learned slip classifier and continuous estimator process acoustic time-series to output slip direction and magnitude at high frequency. 技術方法： A-SLIP 使用整合在夾具指尖的多通道聲學感測，捕捉滑動事件的特徵聲學特徵。學習的滑動分類器和連續估計器處理聲學時間序列，高頻輸出滑動方向和大小。

Key Takeaway: Acoustic slip sensing achieves higher accuracy and lower latency than vision or force-based alternatives with a minimal form-factor addition, enabling reliable real-time in-hand manipulation. 核心發現： 聲學滑動感測以最小的形態因子增加，比視覺或力矩替代方案達到更高的準確率和更低的延遲，實現可靠的即時手內操作。

LEGO: Latent-space Exploration for Geometry-aware Optimization of Humanoid Kinematic Design

Authors: Jihwan Yoon, Taemoon Jeong, Jeongeun Park, et al. | Submitted: 2026-04-09 | arXiv: 2604.08636 Categories: cs.RO, cs.AI

Research Background: Designing robot morphologies and kinematics has relied heavily on human intuition. Motion-design co-optimization offers a path toward automation, but the vast unstructured design space and difficulty constructing task-specific loss functions remain major obstacles. 研究背景： 機器人形態和運動學設計一直嚴重依賴人類直覺。運動-設計協同優化提供了自動化的路徑，但龐大的非結構化設計空間和構建任務特定損失函數的困難仍是主要障礙。

Technical Approach: LEGO embeds robot kinematic designs in a learned latent space that captures geometric structure, enabling gradient-based and evolutionary optimization within the latent space rather than the raw design space. A geometry-aware loss function measures task-relevant kinematic properties directly from the latent representation. 技術方法： LEGO 將機器人運動學設計嵌入捕捉幾何結構的學習潛在空間中，使基於梯度和進化的優化在潛在空間而非原始設計空間中進行。幾何感知損失函數直接從潛在表示測量任務相關的運動學屬性。

Key Takeaway: Latent-space kinematic design optimization discovers novel humanoid morphologies that outperform hand-designed counterparts on target manipulation tasks with significantly reduced engineering effort. 核心發現： 潛在空間運動學設計優化發現新穎的類人型形態，在目標操作任務上優於手工設計的對應物，且大幅減少工程工作量。

Active Reward Machine Inference From Raw State Trajectories

Authors: Mohamad Louai Shehab, Antoine Aspeel, Necmiye Ozay | Submitted: 2026-04-08 | arXiv: 2604.07480 Categories: cs.RO, cs.AI, cs.FL

Research Background: Reward machines capture multi-stage task structure needed for robot policy synthesis, but specifying them by hand is tedious and error-prone. Automating reward machine inference from raw interaction data would lower the barrier to deploying structured RL for robotics. 研究背景： 獎勵機器捕捉機器人策略合成所需的多階段任務結構，但手動指定繁瑣且容易出錯。從原始互動資料自動推斷獎勵機器將降低為機器人部署結構化強化學習的門檻。

Technical Approach: An active inference framework queries the environment with targeted trajectories designed to disambiguate competing reward machine hypotheses. A Bayesian belief over reward machine structures is updated from raw state observations, and the querying policy selects trajectories that maximally reduce uncertainty about the task structure. 技術方法： 主動推斷框架使用針對性軌跡查詢環境，這些軌跡設計用於消除競爭性獎勵機器假設的歧義。從原始狀態觀測更新獎勵機器結構的貝葉斯信念，查詢策略選擇最大限度減少任務結構不確定性的軌跡。

Key Takeaway: Active reward machine inference from raw trajectories recovers accurate task structure with far fewer environment interactions than passive methods, enabling efficient structured RL for robot tasks. 核心發現： 從原始軌跡主動推斷獎勵機器，比被動方法以少得多的環境互動恢復準確的任務結構，使機器人任務的高效結構化強化學習成為可能。

Learning-Based Strategy for Composite Robot Assembly Skill Adaptation

Authors: Khalil Abuibaid, Aleksandr Sidorenko, Achim Wagner, et al. | Submitted: 2026-04-08 | arXiv: 2604.06949 Categories: cs.RO

Research Background: Contact-rich robotic assembly skills remain challenging for industrial robots due to tight geometric tolerances, frictional variability, and uncertain contact dynamics. Existing approaches lack reusability across varying assembly variants. 研究背景： 接觸豐富的機器人裝配技能對工業機器人仍具挑戰性，因為幾何公差嚴格、摩擦變異性和接觸動態不確定。現有方法缺乏跨裝配變體的可重用性。

Technical Approach: A skill-based strategy uses Residual Reinforcement Learning (RRL) to adapt a nominal assembly policy to contact variations. An encapsulated skill module provides reusable insertion primitives that are adapted at runtime by a learned residual corrector responding to contact force feedback. 技術方法： 基於技能的策略使用殘差強化學習（RRL）將標稱裝配策略適應接觸變化。封裝技能模組提供可重用的插入基元，由響應接觸力回饋的學習殘差修正器在運行時適應。

Key Takeaway: RRL-based skill adaptation achieves high assembly success rates across geometric and frictional variation without retraining the base policy, enabling robust industrial deployment. 核心發現： 基於 RRL 的技能自適應在幾何和摩擦變化中實現高裝配成功率，無需重新訓練基礎策略，實現穩健的工業部署。

A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring

Authors: Wenze Wang, Mehdi Hosseinzadeh, Feras Dayoub | Submitted: 2026-04-08 | arXiv: 2604.07395 Categories: cs.RO, cs.AI, cs.CV

Research Background: Robot manipulation systems following language instructions typically execute grasp primitives in a single-shot manner without structured failure monitoring. Empty grasps, slips, timeouts, or semantically wrong grasps are not surfaced to the decision layer, causing silent failures. 研究背景： 遵循語言指令的機器人操作系統通常以單次方式執行抓取基元，沒有結構化的失敗監控。空抓、滑動、超時或語義錯誤的抓取不會上報到決策層，導致靜默失敗。

Technical Approach: Inspired by agentic loops in digital tool-using agents, the system reformulates language-guided grasping as a physical agentic loop. An execution-state monitor classifies grasp outcomes into structured failure types and feeds them back to a language model planner that selects recovery strategies. 技術方法： 受數字工具使用智能體中智能體迴路的啟發，系統將語言引導抓取重新定義為物理智能體迴路。執行狀態監控器將抓取結果分類為結構化失敗類型，並將其反饋給選擇恢復策略的語言模型規劃器。

Key Takeaway: The physical agentic loop substantially improves language-guided grasping success by enabling structured recovery from diverse failure modes, outperforming single-shot baselines significantly. 核心發現： 物理智能體迴路通過從多樣化失敗模式進行結構化恢復，大幅提升語言引導抓取成功率，顯著優於單次執行基線。

RichMap: A Reachability Map Balancing Precision, Efficiency, and Flexibility for Rich Robot Manipulation Tasks

Authors: Yupu Lu, Yuxiang Ma, Jia Pan | Submitted: 2026-04-08 | arXiv: 2604.06778 Categories: cs.RO

Research Background: Reachability maps are essential for robot manipulation planning, representing which workspace configurations are physically reachable by the robot arm. Existing maps trade off between precision and memory/computation efficiency, limiting their use in complex multi-step tasks. 研究背景： 可達性地圖對於機器人操作規劃至關重要，表示機器人臂在物理上可以到達哪些工作空間配置。現有地圖在精度和記憶體/計算效率之間權衡，限制了在複雜多步任務中的使用。

Technical Approach: RichMap refines the classic grid-based reachability map structure with theoretical capacity bounds and streamlined spatial indexing to approach compact form performance while maintaining structural flexibility. Task-specific reachability queries are accelerated through a learned resolution-adaptive lookup scheme. 技術方法： RichMap 利用理論容量界限和精簡的空間索引改進經典基於網格的可達性地圖結構，在保持結構靈活性的同時接近緊湊形式的性能。通過學習的分辨率自適應查找方案加速特定任務的可達性查詢。

Key Takeaway: RichMap provides near-compact-map precision with standard-grid flexibility, enabling efficient reachability computation for complex multi-step manipulation task planning. 核心發現： RichMap 在標準網格靈活性的同時提供接近緊湊地圖的精度，為複雜多步操作任務規劃提供高效的可達性計算。

BiDexGrasp: Coordinated Bimanual Dexterous Grasps across Object Geometries and Sizes

Authors: Mu Lin, Yi-Lin Wei, Jiaxuan Chen, et al. | Submitted: 2026-04-08 | arXiv: 2604.06589 Categories: cs.RO

Research Background: Bimanual dexterous grasping is fundamental for achieving human-level dexterity in robotics, but progress is constrained by lack of comprehensive datasets and powerful generation models that handle the combinatorial complexity of two-hand coordination. 研究背景： 雙臂靈巧抓取對於在機器人中實現人類水準的靈巧性至關重要，但進展受限於缺乏處理雙手協調組合複雜性的全面資料集和強大生成模型。

Technical Approach: BiDexGrasp introduces a large-scale bimanual dexterous grasp dataset and a generation model. A novel synthesis pipeline efficiently annotates physically valid bimanual grasps across diverse object geometries by decomposing hand-object contact into independent per-hand optimization stages coupled by a coordination constraint. 技術方法： BiDexGrasp 引入大規模雙臂靈巧抓取資料集和生成模型。一種新型合成管線通過將手-物體接觸分解為由協調約束耦合的獨立每手優化階段，高效地標注跨多樣物體幾何形狀的物理上有效的雙臂抓取。

Key Takeaway: The BiDexGrasp dataset and generation model advance the state of the art in bimanual dexterous grasping, enabling data-driven policies for coordinated two-handed object manipulation. 核心發現： BiDexGrasp 資料集和生成模型推進了雙臂靈巧抓取的最佳水準，為協調雙手物體操作的資料驅動策略提供支持。

Learning-Guided Force-Feedback Model Predictive Control with Obstacle Avoidance for Robotic Deburring

Authors: Krzysztof Wojciechowski, Ege Gursoy, Arthur Haffemayer, et al. | Submitted: 2026-04-07 | arXiv: 2604.06133 Categories: cs.RO

Research Background: Model Predictive Control is widely used for torque-controlled robots but classical formulations neglect real-time force feedback and struggle with contact-rich industrial tasks like deburring under collision constraints. 研究背景： 模型預測控制廣泛用於力矩控制機器人，但經典公式忽略即時力回饋，並在碰撞約束下的去毛刺等接觸豐富的工業任務中掙扎。

Technical Approach: A learning-guided MPC framework integrates force-feedback into the MPC cost function through a learned force model that predicts contact forces from robot state. Collision avoidance constraints are incorporated as soft barrier functions, allowing the controller to balance force tracking and obstacle avoidance simultaneously. 技術方法： 學習引導的 MPC 框架通過學習的力模型（從機器人狀態預測接觸力）將力回饋整合到 MPC 成本函數中。碰撞避免約束作為軟障礙函數納入，允許控制器同時平衡力追蹤和障礙物避免。

Key Takeaway: Learning-guided force-feedback MPC achieves robust deburring performance with collision avoidance, outperforming pure MPC and pure learning-based approaches on industrial deburring benchmarks. 核心發現： 學習引導的力回饋 MPC 在碰撞避免的情況下實現穩健的去毛刺性能，在工業去毛刺基準上優於純 MPC 和純學習方法。

You’re Pushing My Buttons: Instrumented Learning of Gentle Button Presses

Authors: Raman Talwar, Remko Proesmans, Thomas Lips, et al. | Submitted: 2026-04-07 | arXiv: 2604.05954 Categories: cs.RO

Research Background: Contact-rich manipulation is difficult to learn from cameras and proprioception alone because contact events are only partially observable. Button pressing is a useful testbed requiring gentle force application at precise locations, with contact cues unobservable from standard sensors. 研究背景： 接觸豐富的操作難以僅從相機和本體感知學習，因為接觸事件只能部分觀測。按鈕按壓是一個有用的測試台，需要在精確位置施加溫和的力，且接觸提示無法從標準感測器觀測。

Technical Approach: Training-time object instrumentation adds a microphone fingertip to the robot gripper to capture contact-relevant acoustic signals. The instrumented policy learns to detect and respond to acoustic contact cues, but microphone sensing is removed at deployment time, testing whether contact knowledge transfers to the uninstrumented policy. 技術方法： 訓練時物體儀器化為機器人夾具增加麥克風指尖以捕捉接觸相關的聲學信號。儀器化策略學習偵測和響應聲學接觸提示，但在部署時移除麥克風感測，測試接觸知識是否遷移到非儀器化策略。

Key Takeaway: Training-time acoustic instrumentation significantly improves policy learning for gentle contact tasks even when instrumentation is removed at test time, demonstrating the value of richer contact sensing during demonstrations. 核心發現： 訓練時聲學儀器化顯著改善溫和接觸任務的策略學習，即使在測試時移除儀器化，展示了示範期間更豐富接觸感測的價值。

BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination

Authors: Xingyu Peng, Chen Gao, Liankai Jin, et al. | Submitted: 2026-04-07 | arXiv: 2604.05831 Categories: cs.RO

Research Background: Bimanual manipulation is essential for human-level dexterity but existing simulation benchmarks feature short-horizon and loosely-coordinated tasks that fail to capture the tight spatial-temporal coupling inherent in real bimanual tasks like folding cloth or assembling objects. 研究背景： 雙臂操作對人類水準的靈巧性至關重要，但現有模擬基準的任務短期且協調鬆散，未能捕捉折疊布料或裝配物體等真實雙臂任務中固有的緊密時空耦合。

Technical Approach: BiCoord is a simulation benchmark with long-horizon bimanual tasks requiring precise spatial-temporal arm coordination. Tasks are designed with explicit coupling constraints between arms — one arm must complete a state before the other can proceed — evaluating policies on coordination success beyond individual arm control. 技術方法： BiCoord 是包含需要精確時空手臂協調的長期雙臂任務的模擬基準。任務設計有手臂之間的明確耦合約束（一隻手臂必須完成某個狀態才能讓另一隻繼續），超越個別手臂控制評估策略的協調成功。

Key Takeaway: State-of-the-art bimanual policies fail significantly on BiCoord’s long-horizon coordination tasks, exposing a critical gap between current methods and human-level bimanual dexterity. 核心發現： 最先進的雙臂策略在 BiCoord 的長期協調任務上顯著失敗，揭示了當前方法與人類水準雙臂靈巧性之間的關鍵差距。

A Benchmark of Dexterity for Anthropomorphic Robotic Hands

Authors: Davide Liconti, Yuning Zhou, Yasunori Toshimitsu, Ronan Hinchet, Robert K. Katzschmann | Submitted: 2026-04-10 | arXiv: 2604.09294 Categories: cs.RO

Research Background: Dexterity in robotic hand design lacks a consistent, quantitative definition — existing metrics focus on kinematic properties rather than actual task performance, making it difficult to compare designs or track progress toward human-level manipulation capability.

Technical Approach: POMDAR introduces a standardized dexterity benchmark built on human motor control taxonomies. It comprises 18 tasks across four manipulation categories (vertical, horizontal, continuous-rotation, and grasping). Mechanical constraints prevent compensatory strategies, ensuring tasks isolate the intended motions. Performance is scored as throughput combining task correctness (80%) and execution speed relative to human baselines (20%). Open-source CAD files, simulation assets, and evaluation videos are provided.

Key Takeaway: By grounding dexterity in standardized task throughput rather than kinematics, POMDAR enables consistent cross-design comparisons and provides a concrete target for dexterous manipulation research.

研究背景： 機器人手靈巧性的定義長期缺乏一致且量化的標準——現有指標多聚焦於運動學特性而非實際任務表現，導致不同設計間難以比較，也難以追蹤朝人類水準操作能力的進展。

技術方法： POMDAR 基於人類運動控制分類學，建立標準化靈巧性基準，包含 18 個任務，分為四類操作配置（垂直、水平、連續旋轉與純抓取）。機械限制防止補償性動作，確保每項任務精準隔離目標運動。評分以吞吐量衡量，結合任務正確性（80%）與相對人類基準的執行速度（20%）。提供開源 CAD 檔案、模擬素材與評估影片。

核心發現： 以標準化任務吞吐量取代運動學指標作為靈巧性基準，POMDAR 使跨設計比較成為可能，並為靈巧操作研究提供具體的進步目標。

Explorer

arXiv Digest — 2026-W16

arXiv Weekly Digest — Week 16, 2026

Graph View

Table of Contents