本文由 AI 分析生成
Summary
ML6’s hands-on field report is the most practically grounded LeRobot evaluation available. They ran ACT and GR00T-N1 on SO-ARM100 arms across structured pick-and-place and deformable object tasks. ACT hits 90% on simple positional tasks with ~46k frames but fails on distribution shifts (different camera angles). GR00T-N1 handles more complex tasks (textile manipulation at 60–80%) but stutters due to inference latency. The central finding: data quality and curation matter more than model choice; loss curves don’t predict physical success.
ML6 在 SO-ARM100 上對 ACT 和 GR00T-N1 進行了最具實用價值的評估。ACT 在簡單位置任務上用 46k 幀達到 90% 成功率但無法泛化。GR00T-N1 處理更複雜任務(紡織操作 60-80%)但因推論延遲而抖動。核心發現:資料品質比模型選擇更重要。
Key Points
- ACT on 5-position task: 90% success with 46k frames (~25 min teleoperation recording)
- ACT failure mode: zero generalization to camera angle changes — brittle to distribution shift
- GR00T-N1 on textile spreading: 60% success with 53k frames (29 min)
- GR00T-N1 on towel folding: 80% with 76k frames (42 min)
- GR00T-N1 failure mode: inference latency causes stuttering motion — addressed in N1.5 + async inference
- Dataset curation rule: 4 factors — accuracy, controlled sequences, comprehensive coverage, robustness (include error recovery)
- Evaluation challenge: loss does NOT correlate with physical success; mm-level errors cause manipulation failure
Insights
The data recording time numbers are key practical constraints: 25–42 minutes of teleoperation to get usable performance. This is accessible but requires skilled operators — bad demonstrations hurt more than fewer good ones.
The fact that ACT fails on camera angle changes while GR00T-N1 handles deformable objects suggests the two models occupy different niches: ACT for precise, repetitive tasks in fixed setups; VLAs for tasks requiring semantic understanding or handling physical variation.
ML6 placed 3rd in the 2025 LeRobot Hackathon — using Gaussian splatting to handle camera instability. This is a real production pattern for stabilizing visual observations.
資料錄製時間是關鍵實際約束:25-42 分鐘遙操作可達到可用性能。ACT 失敗於相機角度變化,而 GR00T-N1 能處理可變形物體,兩個模型佔據不同應用場景:ACT 適合固定環境精確重複任務,VLA 適合需要語義理解或物理變化的任務。
Connections
- Clippings-lerobot-open-source-robot-learning-library-arxiv
- Clippings-vla-0-building-state-of-the-art-vlas-with-zero-modification
Raw Excerpt
Imitation learning is “closer than most expect” for production robotics in controlled environments with repetitive, structured tasks.