Summary

ML6’s hands-on field report is the most practically grounded LeRobot evaluation available. They ran ACT and GR00T-N1 on SO-ARM100 arms across structured pick-and-place and deformable object tasks. ACT hits 90% on simple positional tasks with ~46k frames but fails on distribution shifts (different camera angles). GR00T-N1 handles more complex tasks (textile manipulation at 60–80%) but stutters due to inference latency. The central finding: data quality and curation matter more than model choice; loss curves don’t predict physical success.

ML6 在 SO-ARM100 上對 ACT 和 GR00T-N1 進行了最具實用價值的評估。ACT 在簡單位置任務上用 46k 幀達到 90% 成功率但無法泛化。GR00T-N1 處理更複雜任務(紡織操作 60-80%)但因推論延遲而抖動。核心發現:資料品質比模型選擇更重要。

Key Points

  • ACT on 5-position task: 90% success with 46k frames (~25 min teleoperation recording)
  • ACT failure mode: zero generalization to camera angle changes — brittle to distribution shift
  • GR00T-N1 on textile spreading: 60% success with 53k frames (29 min)
  • GR00T-N1 on towel folding: 80% with 76k frames (42 min)
  • GR00T-N1 failure mode: inference latency causes stuttering motion — addressed in N1.5 + async inference
  • Dataset curation rule: 4 factors — accuracy, controlled sequences, comprehensive coverage, robustness (include error recovery)
  • Evaluation challenge: loss does NOT correlate with physical success; mm-level errors cause manipulation failure

Insights

The data recording time numbers are key practical constraints: 25–42 minutes of teleoperation to get usable performance. This is accessible but requires skilled operators — bad demonstrations hurt more than fewer good ones.

The fact that ACT fails on camera angle changes while GR00T-N1 handles deformable objects suggests the two models occupy different niches: ACT for precise, repetitive tasks in fixed setups; VLAs for tasks requiring semantic understanding or handling physical variation.

ML6 placed 3rd in the 2025 LeRobot Hackathon — using Gaussian splatting to handle camera instability. This is a real production pattern for stabilizing visual observations.

資料錄製時間是關鍵實際約束:25-42 分鐘遙操作可達到可用性能。ACT 失敗於相機角度變化,而 GR00T-N1 能處理可變形物體,兩個模型佔據不同應用場景:ACT 適合固定環境精確重複任務,VLA 適合需要語義理解或物理變化的任務。

Connections

Raw Excerpt

Imitation learning is “closer than most expect” for production robotics in controlled environments with repetitive, structured tasks.