arXiv Weekly Digest — Week 14, 2026
Fetched: 2026-04-03 | Categories: cs.RO, cs.LG, cs.HC, cs.CV | Papers: 26
Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning
Authors: Xueying Li, Feng Lyu, Hao Wu, et al. | Submitted: 2026-04-02 | arXiv: 2604.02318
Categories: cs.RO, cs.CV
Research Background: Training-free VLN agents powered by foundation models tend to get stuck in local oscillations and redundant revisiting because they lack self-monitoring: they have no mechanism to detect when their exploration strategy is failing.
研究背景: 以基礎模型驅動的免訓練 VLN agent 常陷入局部震盪與重複探索,根本原因是缺乏自我監控能力,無法偵測當前探索策略是否已失效。
Technical Approach: MetaNav introduces metacognitive reasoning via three components: a persistent 3D semantic map for spatial memory, history-aware planning that penalizes revisited frontiers, and reflective correction that uses an LLM to generate new navigation rules when stagnation is detected.
技術方法: MetaNav 透過三個模組實現元認知推理:持久化 3D 語意地圖(空間記憶)、懲罰重訪 frontier 的歷史感知規劃,以及在停滯時呼叫 LLM 產生修正規則的反思校正機制。
Key Takeaway: MetaNav achieves state-of-the-art on GOAT-Bench, HM3D-OVON, and A-EQA while reducing VLM query count by 20.7%.
核心發現: MetaNav 在 GOAT-Bench、HM3D-OVON、A-EQA 達到 SOTA,同時減少 20.7% 的 VLM 查詢次數,證明元認知推理可同時提升效率與魯棒性。
Deep Neural Network Based Roadwork Detection for Autonomous Driving
Authors: Sebastian Wullrich, Nicolai Steinke, Daniel Goehring | Submitted: 2026-04-02 | arXiv: 2604.02282
Categories: cs.RO, cs.CV
Research Background: Road construction sites are among the most dangerous and unpredictable scenarios for autonomous vehicles due to their heterogeneous signage, temporary barriers, and constantly changing layouts.
研究背景: 道路施工現場是自駕車最難處理的情境之一,臨時標誌、護欄與持續變動的佈局使感知與定位面臨極大挑戰。
Technical Approach: The system fuses a YOLO-based detector with LiDAR depth data to identify individual roadwork objects in real time, merge them into coherent construction zones, and record their outlines in world coordinates. Training combined a US dataset with new data from prototype drives in Berlin.
技術方法: 系統將 YOLO 偵測器與 LiDAR 深度資料融合,即時識別施工物件、合併為連貫施工區,並以世界座標記錄輪廓。訓練資料結合美國公開資料集與柏林原型車實際行駛資料。
Key Takeaway: The system achieves localization accuracy below 0.5 m on real-world construction sites.
核心發現: 在真實施工現場評估中,系統定位精度低於 0.5 公尺,可支援交通主管機關即時掌握路況。
Model-Based Reinforcement Learning for Control under Time-Varying Dynamics
Authors: Klemens Iten, Bruce Lee, Chenhao Li, et al. | Submitted: 2026-04-02 | arXiv: 2604.02260
Categories: cs.LG, cs.RO
Research Background: Most learning-based controllers assume stationary system dynamics, but real-world systems exhibit drift, wear, and changing operating conditions that violate this assumption, limiting deployment safety.
研究背景: 大多數基於學習的控制器假設系統動態是穩態的,但真實系統存在漂移、磨損與工況變化,這一假設在實際部署中常遭違反。
Technical Approach: The authors analyze continual model-based RL with Gaussian process dynamics under a frequentist variation-budget framework, showing that non-stationarity requires explicitly discarding outdated data. They propose an optimistic MBRL algorithm with adaptive data buffer mechanisms to maintain calibrated uncertainty.
技術方法: 作者在頻率主義變化預算框架下分析了以高斯過程動態為基礎的持續 MBRL,證明非穩態環境需主動捨棄過時資料。他們提出具自適應資料緩衝區的樂觀 MBRL 演算法以維持校準不確定性。
Key Takeaway: Adaptive data buffering is theoretically necessary and empirically sufficient to achieve meaningful dynamic regret guarantees under non-stationary dynamics.
核心發現: 自適應資料緩衝在理論上是必要的,在實驗中也能有效保證非穩態動態下的動態遺憾上界。
A virtual-variable-length method for robust inverse kinematics of multi-segment continuum robots
Authors: Weiting Feng, Federico Renda, Yunjie Yang, Francesco Giorgio-Serchi | Submitted: 2026-04-02 | arXiv: 2604.02256
Categories: cs.RO, eess.SY, math.NA
Research Background: Inverse kinematics (IK) solvers for continuum manipulators frequently fail to converge (deadlock) when initialized from neutral configurations, especially near workspace boundaries, limiting their practical reliability.
研究背景: 連續體機械臂的反向運動學求解器在從中性構型初始化時常發生收斂失敗(死鎖),尤其在工作空間邊界附近,嚴重影響實用可靠性。
Technical Approach: The Virtual-Variable-Length (VVL) method introduces fictitious segment-length variations during iteration, granting virtual axial degrees of freedom that break deadlocks. Over 1.8 million randomized trials across manipulators with two to seven segments, VVL is benchmarked against Jacobian and Damped Least Squares solvers.
技術方法: VVL 方法在迭代過程中引入虛擬段長變化,賦予虛擬軸向自由度以打破死鎖。透過超過 180 萬次隨機試驗,對比 Jacobian 與 Damped Least Squares 求解器進行基準測試。
Key Takeaway: VVL achieves up to 20% higher convergence success rate and 40–80% fewer iterations than standard solvers at equivalent accuracy thresholds.
核心發現: VVL 相較標準求解器收斂成功率提升達 20%,在等精度要求下迭代次數減少 40–80%。
UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
Authors: Qiyao Zhang, Shuhua Zheng, Jianli Sun, et al. | Submitted: 2026-04-02 | arXiv: 2604.02241
Categories: cs.CV, cs.RO
Research Background: UAV visual tracking in dynamic urban environments requires understanding complex semantic context beyond pure appearance matching, motivating the use of VLA models for embodied aerial control.
研究背景: 無人機在動態城市環境中的視覺追蹤需要超越純外觀匹配的語意理解,這促使研究者將 VLA 模型引入具身空中控制領域。
Technical Approach: UAV-Track VLA builds on the π₀.₅ architecture, adding a temporal compression network for inter-frame dynamics and a parallel dual-branch decoder with a spatial-aware grounding head and a flow-matching action expert. The authors also construct a benchmark with 890K+ frames, 176 tasks, and 85 diverse objects.
技術方法: UAV-Track VLA 以 π₀.₅ 為基礎,加入時序壓縮網路捕捉幀間動態,以及包含空間感知輔助定位頭與流匹配動作專家的雙分支解碼器。作者亦建構含 89 萬幀、176 任務、85 種目標的基準資料集。
Key Takeaway: UAV-Track VLA achieves 61.76% success on long-distance pedestrian tracking while cutting single-step inference latency by 33.4% versus the base π₀.₅.
核心發現: 在長距離行人追蹤任務中成功率達 61.76%,單步推理延遲較原始 π₀.₅ 降低 33.4%(達 0.057 秒)。
ActionParty: Multi-Subject Action Binding in Generative Video Games
Authors: Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, et al. | Submitted: 2026-04-02 | arXiv: 2604.02330
Categories: cs.CV, cs.AI, cs.LG
Research Background: Video diffusion world models are largely single-agent, unable to simultaneously control multiple characters — a key limitation for generative multi-player game environments.
研究背景: 影片擴散世界模型基本上限於單 agent,無法同時控制多個角色,這是生成式多人遊戲環境的核心瓶頸。
Technical Approach: ActionParty introduces subject state tokens — persistent latent variables tracking each subject — and a spatial biasing mechanism that jointly models these tokens with video latents, decoupling global scene rendering from per-subject action updates.
技術方法: ActionParty 引入主體狀態 token(持續追蹤每個主體的潛在變數),以及空間偏置機制,將這些 token 與影片潛在表示聯合建模,將全域場景渲染與個別主體動作更新解耦。
Key Takeaway: ActionParty is the first video world model capable of controlling up to seven players simultaneously across 46 environments, with significant gains in action-following accuracy and identity consistency.
核心發現: ActionParty 是首個可在 46 種環境中同時控制最多 7 名玩家的影片世界模型,動作跟隨精度與身份一致性均大幅提升。
Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
Authors: Daiwei Chen, Zhoutong Fu, Chengming Jiang, et al. | Submitted: 2026-04-02 | arXiv: 2604.02324
Categories: cs.CL, cs.AI, cs.LG
Research Background: Extending language models with new vocabulary tokens (e.g., Semantic-ID tokens for recommendation) is standard practice, but mean initialization collapses all new tokens into a degenerate subspace that fine-tuning struggles to recover from.
研究背景: 以新詞彙 token(如推薦系統的 Semantic-ID)擴展語言模型是常見做法,但均值初始化會使所有新 token 塌縮至退化子空間,後續微調難以恢復辨別性。
Technical Approach: Grounded Token Initialization (GTI) adds a lightweight grounding stage before fine-tuning that maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision, without modifying the model architecture.
技術方法: GTI 在微調前加入輕量錨定階段,僅使用語言配對監督將新 token 映射至預訓練嵌入空間中語意明確且分散的位置,無需修改模型架構。
Key Takeaway: GTI outperforms mean initialization and existing auxiliary-task methods on multiple generative recommendation benchmarks, including industry-scale datasets.
核心發現: GTI 在多個生成式推薦基準(含工業規模資料集)上優於均值初始化與現有輔助任務方法,且初始化品質是詞彙擴展的關鍵瓶頸。
Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning
Authors: Bangji Yang, Hongbo Ma, Jiajun Fan, Ge Liu | Submitted: 2026-04-02 | arXiv: 2604.02322
Categories: cs.LG, cs.AI, cs.CL
Research Background: Chain-of-thought reasoning inflates inference costs, and existing efficiency methods (length penalties, difficulty estimators) either degrade quality or require complex multi-stage training pipelines.
研究背景: Chain-of-thought 推理導致推理成本膨脹,現有效率改善方法(長度懲罰、難度估計)要麼降低品質,要麼需要複雜的多階段訓練流程。
Technical Approach: Batched Contextual Reinforcement (BCR) trains the model to solve N problems simultaneously in a shared context window, rewarded purely by per-instance accuracy. This creates an implicit token budget without explicit length supervision, revealing a task-scaling law: per-problem token usage decreases monotonically as N increases.
技術方法: BCR 訓練模型在共享上下文視窗中同時解決 N 個問題,純以每題準確率為獎勵。這以無需顯式長度監督的方式建立隱式 token 預算,揭示了任務縮放定律:隨 N 增加,每題 token 用量單調遞減。
Key Takeaway: BCR reduces token usage by 15.8–62.6% on 1.5B–4B models while maintaining or improving accuracy on five math benchmarks, without the instability of explicit length penalties.
核心發現: BCR 在 1.5B–4B 模型上減少 15.8–62.6% 的 token 用量,同時在五個數學基準上維持或提升準確率,且無顯式長度懲罰的優化不穩定問題。
Topological Effects in Neural Network Field Theory
Authors: Christian Ferko, James Halverson, Vishnu Jejjala, Brandon Robinson | Submitted: 2026-04-02 | arXiv: 2604.02313
Categories: hep-th, cs.LG
Research Background: Neural network field theory treats trained networks as statistical field ensembles, but the framework has not been extended to topologically non-trivial settings where discrete topological quantum numbers matter.
研究背景: 神經網路場論將訓練後的網路視為統計場系综,但此框架尚未延伸至離散拓樸量子數扮演重要角色的拓樸非平凡設定。
Technical Approach: The authors extend the construction by including discrete parameters labeling topological quantum numbers, then recover the Berezinskii-Kosterlitz-Thouless transition, spin-wave critical line, and vortex proliferation. T-duality of the bosonic string is also verified, including Buscher-rule transformations on toroidal backgrounds.
技術方法: 作者透過引入標記拓樸量子數的離散參數擴展框架,復現了 BKT 相變、自旋波臨界線與渦旋增殖,並驗證了玻色弦的 T 對偶性,包含環面背景下的 Buscher 規則變換。
Key Takeaway: Neural network field theory can faithfully encode topological phase structure and string duality, opening connections between deep learning theory and high-energy physics.
核心發現: 神經網路場論能忠實編碼拓樸相結構與弦對偶性,為深度學習理論與高能物理建立了新的連結。
go-mHC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices
Authors: Torque Dandachi, Sophia Diggs-Galligan | Submitted: 2026-04-02 | arXiv: 2604.02309
Categories: cs.LG, cs.CL
Research Background: Doubly stochastic matrices enable learned mixing across residual streams in transformer architectures, but exact parameterization of the full Birkhoff polytope has remained either factorial in cost or expressivity-limited.
研究背景: 雙隨機矩陣能讓 Transformer 在殘差流之間進行學習混合,但完整 Birkhoff 多面體的精確參數化至今要麼成本呈階乘增長,要麼表達力受限。
Technical Approach: go-mHC introduces exact O(d³) parameterization via generalized orthostochastic matrices with a single hyperparameter s that interpolates between a computationally efficient boundary and the fully expressive Birkhoff polytope, composing naturally with Kronecker-factorized methods.
技術方法: go-mHC 透過廣義正交隨機矩陣引入 O(d³) 精確參數化,以單一超參數 s 在高效邊界與完整 Birkhoff 多面體之間連續插值,並可與 Kronecker 分解方法自然組合。
Key Takeaway: go-mHC achieves minimum theoretical loss on stream-mixing tasks while converging up to 10× faster, recovering substantial expressivity at similar FLOP costs versus Kronecker baselines.
核心發現: go-mHC 在流混合任務上達到理論最小損失,收斂速度最高快 10 倍,以相近 FLOP 代價大幅恢復 Kronecker 基準的表達力損失。
ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline
Authors: Juan Manuel Hernandez, Mariana Fernandez-Espinosa, Denis Parra, Diego Gomez-Zara | Submitted: 2026-04-02 | arXiv: 2604.02182
Categories: cs.CV, cs.HC
Research Background: Vision Transformers are widely used but poorly understood — existing interpretability tools focus on isolated components or expert-level analysis, leaving a gap for guided, end-to-end educational explanation.
研究背景: Vision Transformer 廣泛應用但鮮少被深入理解——現有可解釋性工具聚焦於局部元件或專家分析,缺乏引導式的端到端教育解說工具。
Technical Approach: ViT-Explainer is an interactive visualization system that walks users through the complete ViT inference pipeline — from patch tokenization through multi-head attention to classification — using linked animated views, step-by-step narration, and user-uploaded images to ground explanations in concrete examples.
技術方法: ViT-Explainer 是互動式視覺化系統,引導使用者走過完整 ViT 推論流程——從 patch 切分、多頭注意力到分類——透過連結動畫視圖、逐步旁白與用戶上傳圖片提供具體情境解說。
Key Takeaway: ViT-Explainer demonstrates that guided interactive walkthroughs can substantially improve conceptual understanding of complex transformer architectures in non-expert users.
核心發現: ViT-Explainer 表明引導式互動演示可顯著提升非專家用戶對複雜 Transformer 架構的概念理解。
EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors
Authors: Luca Bartolomei, Fabio Tosi, Matteo Poggi, et al. | Submitted: 2026-04-02 | arXiv: 2604.02331
Categories: cs.CV
Research Background: Training deep event-based stereo networks requires ground-truth annotations from costly active sensors (LiDAR, structured light), creating a data bottleneck that limits generalization to new environments.
研究背景: 訓練深度事件立體網路需要來自昂貴主動感測器(LiDAR、結構光)的真實標註,形成資料瓶頸,限制了對新環境的泛化能力。
Technical Approach: EventHub uses state-of-the-art novel view synthesis to derive proxy depth annotations and proxy events from standard RGB images, eliminating the need for active sensors. The generated training set is used to repurpose existing RGB stereo models for event-based inputs.
技術方法: EventHub 利用最先進的新視角合成技術從標準 RGB 影像推導代理深度標註與代理事件,消除對主動感測器的依賴,並以生成的訓練集將現有 RGB 立體模型遷移至事件輸入。
Key Takeaway: EventHub enables training competitive event-based stereo networks without any active sensor data, demonstrating strong cross-domain generalization.
核心發現: EventHub 無需任何主動感測器資料即可訓練出具競爭力的事件立體網路,並展現強跨域泛化能力。
Generative World Renderer
Authors: Zheng-Hui Huang, Zhixiang Wang, Jiaming Tan, et al. | Submitted: 2026-04-02 | arXiv: 2604.02329
Categories: cs.CV
Research Background: Scaling generative inverse and forward rendering to real-world scenarios is limited by the low realism and temporal incoherence of existing synthetic datasets, creating a persistent domain gap.
研究背景: 生成式逆向與正向渲染擴展至真實場景受限於現有合成資料集的低真實感與時序不連貫性,形成持續性的域差距。
Technical Approach: The authors curate a large-scale dynamic dataset from AAA games using a dual-screen stitched capture method, extracting 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes. This data trains a generative world renderer for realistic and temporally coherent novel view synthesis.
技術方法: 作者以雙螢幕拼接擷取方式從 AAA 遊戲中整理大規模動態資料集,提取 400 萬幀(720p/30 FPS)的同步 RGB 與五個 G-buffer 通道。此資料訓練生成式世界渲染器以實現真實且時序連貫的新視角合成。
Key Takeaway: AAA-game-sourced G-buffer datasets bridge the realism gap for generative rendering, enabling temporally coherent scene synthesis at scale.
核心發現: 來自 AAA 遊戲的 G-buffer 資料集有效彌補了生成式渲染的真實感差距,實現了大規模時序連貫的場景合成。
Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection
Authors: Alex Costanzino, Pierluigi Zama Ramirez, Giuseppe Lisanti, Luigi Di Stefano | Submitted: 2026-04-02 | arXiv: 2604.02328
Categories: cs.CV
Research Background: 3D anomaly detection requires fusing RGB and depth/point-cloud data across multiple viewpoints, but most methods process views independently, missing cross-view geometric relationships.
研究背景: 3D 異常偵測需要跨多視角融合 RGB 與深度/點雲資料,但大多數方法獨立處理各視角,遺漏了跨視角幾何關係。
Technical Approach: ModMap introduces a natively multiview and multimodal framework that learns to map features across modalities and views simultaneously. A cross-view feature-wise modulation mechanism explicitly models view-dependent relationships, trained with a cross-view strategy covering all view-pair combinations.
技術方法: ModMap 引入原生多視角多模態框架,同時學習跨模態與跨視角的特徵映射。跨視角特徵調制機制明確建模視角依賴關係,以涵蓋所有視角對組合的交叉視角策略訓練。
Key Takeaway: ModMap achieves state-of-the-art 3D anomaly detection and segmentation by jointly exploiting cross-modal and cross-view feature relationships.
核心發現: ModMap 透過同時利用跨模態與跨視角特徵關係,在 3D 異常偵測與分割上達到最先進效能。
Steerable Visual Representations
Authors: Jona Ruthardt, Manu Gaur, Deva Ramanan, et al. | Submitted: 2026-04-02 | arXiv: 2604.02327
Categories: cs.CV, cs.AI
Research Background: Pretrained ViTs (e.g., DINOv2, MAE) produce generic features that gravitate toward salient cues, with no mechanism to direct attention toward less prominent but task-relevant concepts without retraining.
研究背景: 預訓練 ViT(如 DINOv2、MAE)產生傾向顯著線索的通用特徵,在不重新訓練的情況下無法引導其關注不顯著但任務相關的概念。
Technical Approach: Steerable Visual Representations (SVR) introduces a lightweight steering mechanism that conditions frozen ViT features on textual or visual prompts at inference time, without modifying the backbone. The method uses a small adapter that modulates feature maps toward the specified concept.
技術方法: SVR 引入輕量引導機制,在推論時以文字或視覺提示調制凍結的 ViT 特徵,無需修改主幹網路,使用小型 adapter 將特徵圖導向指定概念。
Key Takeaway: SVR enables prompt-guided feature steering in frozen ViTs, improving downstream performance on tasks requiring attention to non-salient visual concepts.
核心發現: SVR 使凍結 ViT 能進行提示引導的特徵轉向,在需要關注非顯著視覺概念的下游任務中提升了效能。
Beyond Referring Expressions: Scenario Comprehension Visual Grounding
Authors: Ruozhen He, Nisarg A. Shah, Qihua Dong, et al. | Submitted: 2026-04-02 | arXiv: 2604.02323
Categories: cs.CV
Research Background: Existing visual grounding benchmarks test alignment between image regions and literal referring expressions, allowing models to succeed by category matching alone, which does not capture true scene comprehension.
研究背景: 現有視覺定位基準測試圖像區域與文字指稱表達的對齊,使模型可憑類別匹配通過,無法反映真正的場景理解能力。
Technical Approach: The authors introduce Referring Scenario Comprehension (RSC), a new benchmark where targets must be inferred from roles, intentions, and relational context rather than explicit naming. Models are evaluated on whether they can identify correct regions given underspecified, scenario-based queries.
技術方法: 作者提出指稱場景理解(RSC)新基準,目標必須從角色、意圖與關係脈絡推斷,而非顯式命名。評估模型在場景式模糊查詢下能否識別正確區域。
Key Takeaway: Current visual grounding models show substantial performance drops on RSC versus standard referring expression benchmarks, revealing a key gap between region-text alignment and genuine scene comprehension.
核心發現: 現有視覺定位模型在 RSC 上相較標準基準出現顯著效能下降,揭示了區域-文字對齊與真正場景理解之間的核心差距。
以下為 2026-04-04 補充批次(第二批)
UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving
Authors: Yongkang Li, Lijun Zhou, Sixu Yan, et al. | Submitted: 2026-04-02 | arXiv: 2604.02190
Categories: cs.CV, cs.RO
Research Background: Existing VLA models for autonomous driving face a fundamental tension: adopting 2D vision-language models limits spatial perception, while adding 3D representations degrades native semantic reasoning — because both objectives compete within shared parameters.
研究背景: 現有自駕 VLA 模型面臨根本矛盾:使用 2D VLM 限制空間感知,加入 3D 表示又損害語意推理能力,因為兩者共用參數導致優化目標衝突。
Technical Approach: UniDriveVLA uses a Mixture-of-Transformers architecture with three decoupled experts — understanding, scene perception, and action planning — coordinated via masked joint attention. A sparse perception paradigm and three-stage progressive training improve 3D perception while preserving VLM reasoning.
技術方法: UniDriveVLA 採用 Mixture-of-Transformers 架構,設計三個解耦專家(理解、場景感知、動作規劃),透過遮蔽聯合注意力協調。稀疏感知範式與三階段漸進訓練在提升 3D 感知能力的同時保留 VLM 推理能力。
Key Takeaway: Expert decoupling resolves the perception-reasoning conflict in driving VLAs, achieving SOTA on nuScenes open-loop and Bench2Drive closed-loop evaluations across detection, mapping, motion forecasting, and driving VQA.
核心發現: 專家解耦解決了自駕 VLA 的感知-推理矛盾,在 nuScenes 開迴路與 Bench2Drive 閉迴路評估中全面達到最先進效能。
Cross-Modal Visuo-Tactile Object Perception
Authors: Anirvan Dutta, Simone Tasciotti, Claudia Cusseddu | Submitted: 2026-04-02 | arXiv: 2604.02108
Categories: cs.RO, cs.LG
Research Background: Safe robotic manipulation requires estimating physical properties (stiffness, inertia, contact dynamics) that are only indirectly observable. Vision and tactile sensing provide complementary signals, but existing frameworks treat them as static fusions rather than as evolving beliefs under uncertainty.
研究背景: 安全的機器人操控需要估計只能間接觀測的物理屬性(剛度、慣量、接觸動態)。視覺與觸覺感測提供互補訊號,但現有框架多為靜態融合,未考慮不確定性下的信念演化。
Technical Approach: The Cross-Modal Latent Filter (CMLF) learns a causal latent state-space of physical object properties, supporting bidirectional transfer of cross-modal priors between vision and touch. Sensory evidence is integrated through a Bayesian inference process that evolves dynamically during manipulation.
技術方法: CMLF 學習物體物理屬性的因果潛在狀態空間,支援視覺與觸覺間的雙向跨模態先驗傳遞,並透過隨時間演化的貝葉斯推斷整合感測證據。
Key Takeaway: CMLF improves robustness and efficiency of physical property estimation during manipulation, and exhibits human-like perceptual coupling phenomena including susceptibility to cross-modal illusions.
核心發現: CMLF 提升操控中物理屬性估計的魯棒性與效率,並呈現類人的跨模態感知耦合現象(包括對跨模態錯覺的易感性)。
CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects
Authors: Jingliang Li, Jindou Jia, Tuo An | Submitted: 2026-04-02 | arXiv: 2604.02060
Categories: cs.CV, cs.RO
Research Background: When a robot is told to “cut the apple,” it must choose a knife over scissors even though both afford cutting — a challenge existing 3D affordance methods avoid by evaluating isolated objects with explicit category names rather than intent-driven instructions in multi-object scenes.
研究背景: 機器人被指示「切蘋果」時必須在刀與剪刀中選擇,儘管兩者都具備切割功能。現有 3D affordance 方法回避此挑戰,僅對單獨物件搭配明確類別名稱進行評估。
Technical Approach: CompassAD benchmarks implicit intent in confusable multi-object point cloud scenes (30 pairs, 16 affordance types, 88K+ queries). CompassNet addresses it via Instance-Bounded Cross Injection (ICI) for language-geometry alignment within object boundaries, and Bi-level Contrastive Refinement (BCR) for discrimination at geometric-group and point levels.
技術方法: CompassAD 基準測試隱含意圖在多物件點雲場景中的 affordance(30 對混淆對象、16 種 affordance 類型、88K+ 查詢)。CompassNet 透過 ICI 將語言-幾何對齊限制在物件邊界內,並以 BCR 在幾何群組與點級別強化判別。
Key Takeaway: CompassNet achieves SOTA on seen and unseen queries, and real-robot deployment confirms effective transfer to grasping tasks in confusing multi-object scenes.
核心發現: CompassNet 在已見與未見查詢上均達 SOTA,真實機器人部署確認能有效遷移到混淆多物件場景的抓取任務。
SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
Authors: Zhengxi Lu, Zhiyuan Yao, Jinyang Wu | Submitted: 2026-04-02 | arXiv: 2604.02268
Categories: cs.LG
Research Background: LLM agents currently rely on inference-time skill injection (retrieving and loading procedural knowledge at runtime), which suffers from retrieval noise, heavy token overhead, and the fundamental problem that the model never truly learns the skill — it only follows instructions.
研究背景: LLM agent 目前依賴推論時技能注入(在執行時取出並載入程序知識),存在檢索雜訊、大量 token 開銷,以及模型永遠無法真正學習技能的根本問題。
Technical Approach: SKILL0 uses a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped by category and rendered with interaction history into visual context. A Dynamic Curriculum evaluates each skill’s on-policy helpfulness within a linearly decaying budget until the agent operates in a fully zero-shot setting.
技術方法: SKILL0 採用從完整技能 context 逐步撤除的訓練課程。技能按類別分組並與互動歷史一起渲染為視覺 context。動態課程在線性衰減預算內評估每個技能的 on-policy 有用性,直到 agent 能在完全零樣本設定下運作。
Key Takeaway: SKILL0 achieves +9.7% on ALFWorld and +6.6% on Search-QA over standard RL baselines, maintaining context under 0.5k tokens per step — demonstrating that zero-shot skill execution is achievable through curriculum-based internalization.
核心發現: SKILL0 在 ALFWorld 上超越標準 RL 基準 9.7%,在 Search-QA 上超越 6.6%,每步 context 保持在 0.5k tokens 以下,證明零樣本技能執行可透過課程化內化實現。
PRO-SPECT: Probabilistically Safe Scalable Planning for Energy-Aware Coordinated UAV-UGV Teams
Authors: Roger Fowler, Cahit Ikbal Er, Benjamin Johnsenberg | Submitted: 2026-04-02 | arXiv: 2604.02142
Categories: cs.RO, cs.MA
Research Background: Planning for UAV-UGV teams where the UGV serves as a mobile charging station is complicated by stochastic travel times — existing approaches either assume deterministic travel or use fixed robustness margins that don’t bound failure probability across the full mission.
研究背景: 以 UGV 作為移動充電站的 UAV-UGV 協同規劃因隨機行進時間而複雜化,現有方法要麼假設確定性行進,要麼使用固定穩健裕度,無法對整個任務的失敗概率進行有界控制。
Technical Approach: PRO-SPECT models travel times as random variables and bounds the probability of energy depletion to a user-specified risk level. It formulates the problem as a Mixed-Integer Program and provides a polynomial-time algorithm supporting both offline planning and online re-planning.
技術方法: PRO-SPECT 將行進時間建模為隨機變數,將能量耗盡概率限制在用戶指定的風險水平。以混合整數規劃建構問題,提供支援離線規劃與線上重規劃的多項式時間演算法。
Key Takeaway: PRO-SPECT provides theoretically grounded risk-bounded mission planning for heterogeneous robot teams, enabling safe operation under uncertainty without overly conservative fixed margins.
核心發現: PRO-SPECT 為異構機器人團隊提供理論上有界的風險任務規劃,在不依賴過度保守固定裕度的情況下實現不確定性下的安全操作。
ROS 2-Based LiDAR Perception Framework for Mobile Robots in Dynamic Production Environments
Authors: Lukas Bergs, Tan Chung, Marmik Thakkar | Submitted: 2026-04-02 | arXiv: 2604.02109
Categories: cs.RO
Research Background: Adaptive mobile robots in industrial environments need robust 6D pose estimation and multi-object tracking without dependence on large real-world datasets, which are expensive to collect and often insufficient for dynamic, cluttered scenarios.
研究背景: 工業環境中的自適應移動機器人需要穩健的 6D 姿態估計與多目標追蹤,且不能依賴昂貴且往往不足的真實世界資料集。
Technical Approach: A ROS 2 framework integrates Transformation-Equivariant 3D Detection (trained on synthetic data) with multi-object tracking using center poses. Validated across 72 scenarios with motion capture ground truth.
技術方法: ROS 2 框架整合以合成資料訓練的變換等變 3D 偵測器與使用中心姿態的多目標追蹤,在 72 個場景中以動作捕捉系統提供地面真值驗證。
Key Takeaway: Multi-object tracking integration raises IoU from 62.6% (standalone detection) to 83.12%, achieving 91.12% Higher Order Tracking Accuracy on industrial mobile manipulators.
核心發現: 整合多目標追蹤將 IoU 從 62.6%(單獨偵測)提升至 83.12%,在工業移動機械臂上達到 91.12% 的高階追蹤精度。
HyVGGT-VO: Tightly Coupled Hybrid Dense Visual Odometry with Feed-Forward Models
Authors: Junxiang Pan, Lipu Zhou, Baojie Chen | Submitted: 2026-04-02 | arXiv: 2604.02107
Categories: cs.RO
Research Background: Dense visual SLAM using feed-forward models provides rich 3D reconstruction but is too computationally heavy for real-time pose estimation at high frequency. Traditional sparse VO is efficient but lacks dense reconstruction capability — a gap that limits practical deployment in robotics and AR.
研究背景: 使用前饋模型的稠密視覺 SLAM 提供豐富 3D 重建,但計算量過大無法支援高頻即時姿態估計。傳統稀疏 VO 效率高但缺乏稠密重建能力,限制了機器人與 AR 的實際部署。
Technical Approach: HyVGGT-VO tightly couples traditional sparse VO with VGGT (a state-of-the-art feed-forward model) via an adaptive hybrid tracking frontend that dynamically switches between optical flow and the VGGT tracking head. A hierarchical optimization framework jointly refines VO poses and VGGT scale predictions for global consistency.
技術方法: HyVGGT-VO 透過自適應混合追蹤前端(動態切換光流與 VGGT 追蹤頭)緊耦合傳統稀疏 VO 與 VGGT 前饋模型,並以層次優化框架聯合精調 VO 姿態與 VGGT 尺度預測以維持全域一致性。
Key Takeaway: Compared to existing VGGT-based methods, HyVGGT-VO achieves ~5x processing speedup while reducing trajectory error by 85% on EuRoC (indoor) and 12% on KITTI (outdoor).
核心發現: 相較現有 VGGT 方法,HyVGGT-VO 實現約 5 倍加速,同時在 EuRoC 室內資料集上將軌跡誤差減少 85%,在 KITTI 室外基準上減少 12%。
Night Eyes: A Reproducible Framework for Constellation-Based Corneal Reflection Matching
Authors: Virmarie Maquiling, Yasmeen Abdrabou, Enkelejda Kasneci | Submitted: 2026-04-02 | arXiv: 2604.01909
Categories: cs.CV, cs.HC
Research Background: Corneal glint detection is foundational to pupil-corneal reflection eye tracking, a key input modality for HRI and gaze-based robot control. Yet existing methods embed detection as heuristics inside larger systems, making cross-hardware reproducibility and benchmark comparison nearly impossible.
研究背景: 角膜反光偵測是瞳孔-角膜反射眼動追蹤的基礎,也是 HRI 與注視控制機器人的關鍵輸入模式。但現有方法將偵測作為啟發式嵌入大型系統,使跨硬體的可重現性與基準比較幾乎不可能。
Technical Approach: Night Eyes treats glints as structured constellations (inspired by lost-in-space star identification) and applies a Similarity-Layout Alignment (SLA) procedure adapted to multi-LED eye tracker constraints. The pipeline explicitly separates detection and correspondence, using controlled over-detection and appearance-aware scoring.
技術方法: Night Eyes 將反光點視為結構化星座(受航太星圖識別啟發),應用適應多 LED 眼動儀約束的相似度佈局對齊(SLA)程序,明確分離偵測與對應步驟,使用受控過度偵測與外觀感知評分。
Key Takeaway: Night Eyes provides stable identity-preserving glint correspondence under noisy conditions, with full code, presets, and evaluation scripts released to enable transparent comparison across setups.
核心發現: Night Eyes 在雜訊條件下提供穩定的保持身份一致性的反光點對應,並完整發布程式碼、預設值與評估腳本以支援跨裝置透明比較。
A Simple Baseline for Streaming Video Understanding
Authors: Yujiao Shen, Shulin Tian, Jingkang Yang | Submitted: 2026-04-02 | arXiv: 2604.02317
Categories: cs.CV
Research Background: Real-time video understanding is essential for robots operating in dynamic environments, where complex memory mechanisms have been proposed to handle long video streams. This paper challenges whether that complexity is warranted.
研究背景: 即時影片理解對於動態環境中操作的機器人至關重要,已有複雜記憶機制被提出以處理長影片串流,但這種複雜性是否必要值得質疑。
Technical Approach: SimpleStream feeds only the most recent N frames to an off-the-shelf VLM in a sliding-window fashion, with no custom memory modules. Evaluated against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench.
技術方法: SimpleStream 以滑動視窗方式僅將最近 N 幀輸入現成 VLM,不使用自定記憶模組。對比 13 個主要離線與線上影片 LLM 基準在 OVO-Bench 與 StreamingBench 上進行評估。
Key Takeaway: With only 4 recent frames, SimpleStream matches or surpasses published streaming models (67.7% on OVO-Bench, 80.59% on StreamingBench), revealing that many complex methods don’t outperform a well-calibrated simple baseline.
核心發現: 僅用最近 4 幀,SimpleStream 達到或超越已發表的串流模型(OVO-Bench 67.7%、StreamingBench 80.59%),揭示許多複雜方法並不優於校準良好的簡單基準。
Omni123: Exploring 3D Native Foundation Models with Limited 3D Data
Authors: Chongjie Ye, Cheng Cao, Chuanyu Pan | Submitted: 2026-04-02 | arXiv: 2604.02289
Categories: cs.CV, cs.AI
Research Background: Extending multimodal foundation models to 3D remains difficult due to scarce high-quality 3D data. Existing methods lift 2D results into 3D via optimization, sacrificing geometric consistency — a critical limitation for robot spatial understanding.
研究背景: 由於高品質 3D 資料稀缺,將多模態基礎模型擴展到 3D 仍然困難。現有方法透過最佳化將 2D 結果提升至 3D,犧牲幾何一致性——這對機器人空間理解是關鍵限制。
Technical Approach: Omni123 unifies text-to-2D and text-to-3D generation in a single autoregressive framework, representing text, images, and 3D as discrete tokens in a shared sequence space. An interleaved X-to-X training paradigm traverses semantic-visual-geometric cycles without requiring fully aligned text-image-3D triplets.
技術方法: Omni123 在單一自迴歸框架中統一文字到 2D 與文字到 3D 生成,將文字、影像與 3D 表示為共享序列空間中的離散 token。交錯 X-to-X 訓練範式在不需要完全對齊三元組的情況下遍歷語意-視覺-幾何循環。
Key Takeaway: Omni123 significantly improves text-guided 3D generation and editing by using abundant 2D data as geometric prior, offering a scalable path toward multimodal 3D world models relevant to robot scene understanding.
核心發現: Omni123 以豐富的 2D 資料作為幾何先驗,大幅改善文字引導的 3D 生成與編輯,為與機器人場景理解相關的多模態 3D 世界模型提供可擴展路徑。