Summary

Overview of Meta’s DINOv3, a self-supervised vision foundation model trained on 1.689B images plus 493M satellite images, distilled from a 7B-parameter ViT teacher into a family of smaller models. Frozen DINOv3 features achieve SOTA across classification, dense segmentation, depth, tracking, and remote sensing — often without fine-tuning.

Meta DINOv3 的概述,這個自監督視覺基礎模型在 16.89 億圖像上訓練,從 70 億參數的 ViT 教師模型蒸餾為小型模型家族,其凍結特徵在分類、密集分割、深度估計和遙感等任務中無需微調即可達到 SOTA 水準。

Key Points

  • Training data: 1.689B web images + 493M satellite images (curated, not raw internet)
  • Architecture: ViT-7B teacher distilled to family including ConvNeXt and smaller ViTs
  • Training recipe: DINO self-distillation + iBOT masking + regularizers (scaled-up DINOv2 recipe)
  • Frozen features + simple heads (k-NN, linear, light adapters) achieve SOTA on global and dense tasks
  • Practical guidance: prefer frozen features first, fine-tune only if necessary
  • Available on HuggingFace

Insights

DINOv3 represents the maturation of self-supervised vision pretraining — the gap between SSL and supervised pretraining is now essentially closed for most practical tasks. The inclusion of satellite imagery is a notable expansion, suggesting Meta is targeting geospatial AI as a first-class use case. The “prefer frozen features” recommendation has important practical implications for compute budget: you can achieve SOTA-level vision embeddings at inference time without any GPU-intensive fine-tuning.

Connections

Raw Excerpt

Frozen DINOv3 features + simple heads (k-NN, linear, light adapters) deliver SOTA-level results across global and dense tasks.