本文由 AI 分析生成
建立時間: 2026-03-28 來源: https://medium.com/@lixue421/paper-reading-summary-1-vision-transformer-1489da75ea0b
Summary
Shirley Li’s explainer of the Vision Transformer (ViT) paper (“An Image is Worth 16x16 Words,” Dosovitskiy et al., 2021). Covers CNN limitations that motivated ViT, the patch-embedding mechanism, position encodings, self-attention for global dependencies, and ViT’s impact on multi-modal models (CLIP, DALL·E, LLaVA).
Shirley Li 對 Vision Transformer(ViT)論文(Dosovitskiy 等人,2021)的解釋。涵蓋 CNN 的局限性、補丁嵌入機制、位置編碼、全局依賴的自注意力以及 ViT 對多模態模型(CLIP、DALL·E、LLaVA)的影響。
Key Points
- CNN limitations: convolution captures local spatial features (edges, textures) but not long-range global dependencies; fixed receptive field limits context
- ViT approach: split image into 16×16 patches → flatten + linear embedding → treat as sequence of “visual tokens” → standard Transformer encoder
- Position encoding: added to patch embeddings (CNN has implicit position via spatial structure; Transformer needs explicit position information)
- Self-attention: every patch attends to every other patch → global dependencies captured; unlike CNN’s local kernel
- Requires large data: self-attention is powerful but data-hungry; ResNets outperform ViT on small datasets; large-scale pre-training needed
- Multi-modal impact: ViT’s patch-token representation bridges vision and language — enables CLIP’s shared embedding space and LLaVA’s visual tokens
Insights
The “images as sequences of patches” reframe is the key insight that made Transformers work for vision. By treating 16×16 image patches like words in a sentence, ViT brings the full Transformer machinery (self-attention, positional encoding, pre-training) to image tasks. The large-data requirement explains both ViT’s delayed adoption (datasets like JFT-300M were needed) and why it thrives in the CLIP/LLM era where internet-scale pre-training is available. The bridge to multi-modal models is ViT’s most impactful contribution: it created a representation format (visual tokens) that language models can naturally consume.
Connections
Raw Excerpt
ViT demonstrated that self-attention mechanisms could effectively model global dependencies in images, outperforming convolutional neural networks in many tasks with sufficient data and computational resources.