Lecture Overview
This lecture covers how the Transformer architecture has been adapted for vision tasks: Vision Transformer (ViT), DeiT, Swin Transformer, and modern design choices including activation functions, positional encodings, and token design.
Supplementary Notes
SwiGLU in Vision Transformers
SwiGLU works well in NLP Transformers (e.g., LLaMA), but in Vision Transformers it can cause an over-gating problem: the gating mechanism filters out too much information, aggressively suppressing activations. This is related to interactions with LayerScale (a per-channel learnable scaling applied after each residual block) — when combined with SwiGLU's gating, the effective signal can be overly attenuated.
For this reason, GELU remains the recommended activation for Vision Transformers. The over-gating issue is an active area of investigation.
Q&A from Lecture
1. Is [CLS] Token Sufficient for Global Representation?
For classification: yes. Image classification compresses the entire image into a single label, so one global feature vector is sufficient. [CLS] and GAP perform similarly for this task. The more practical concern, as discussed in lecture, is the alignment problem between [CLS] and patch tokens during training.
For dense prediction: no. Tasks like detection and segmentation require spatial information that a single [CLS] token cannot provide. Different models handle this differently:
CLIP uses only [CLS] (designed for classification and retrieval). SAM discards [CLS] and uses patch token dense features. DINOv2 uses [CLS] for classification but patch tokens for dense tasks. This will be explored further in Week 6 (Detection + Segmentation).
2. Register Tokens
Vision Transformers Need Registers (Darcet et al., ICLR 2024) identified an artifact in ViTs like DINOv2 and CLIP: background patches with relatively low information content develop abnormally high norms. This happens because the model repurposes unused background patches as internal memory to store global information — essentially recycling low-information patches as scratch space.
This causes artifacts in attention map visualization and degrades dense prediction performance. The solution is to add a few (typically 4) learnable tokens — similar to [CLS] — that serve as dedicated global information storage. At inference, these register tokens are discarded; only [CLS] and patch tokens are used.
With registers, patch tokens can focus on their intended role (local feature encoding), producing cleaner attention maps and better dense prediction. DINOv2 provides separate model checkpoints with and without registers.
3. Combining [CLS] with GAP or Patch Tokens
Simple concatenation: Yes — concatenating [CLS] with GAP (or max pooling) of patch tokens is used in models like Sentence-BERT when you want both global and local information. This is a straightforward way to combine both signals.
Learnable pooling (MAP): Beyond simple GAP, Multihead Attention Pooling (MAP) uses a learnable query vector that performs cross-attention with all patch tokens. SigLIP/SigLIP 2 uses this approach (though the original SigLIP paper doesn't explicitly mention it, as it builds on the authors' prior work).
Late [CLS] injection (CaiT): CaiT (Touvron et al., 2021) identifies a problem with the standard ViT design: the same attention parameters must simultaneously handle two different jobs — modeling patch-to-patch relationships and aggregating classification information into [CLS]. CaiT separates these by running self-attention among patches only in early layers, then inserting [CLS] in the final 2 layers to read from the patch tokens via cross-attention.
4. Recent Research on [CLS]–Patch Token Interaction
Revisiting [CLS] and Patch Token Interaction in Vision Transformers (Marouani et al., ICLR 2026) takes the CaiT insight further. Rather than only separating [CLS] injection temporally (early vs. late layers), this work separates the processing itself: LayerNorm, LayerScale, and QKV projections are given separate parameters for [CLS] and patch tokens in approximately the first third of the blocks.
Crucially, [CLS] and patch tokens still interact during the attention computation itself, so information exchange is preserved. But the pre- and post-processing around attention is fully decoupled. This recognizes that [CLS] (global summary) and patch tokens (local features) have fundamentally different roles and benefit from separate parameterization.
Further Reading
Course Notes & Lectures
- Blog Google AI Blog: ViT — Accessible overview of the Vision Transformer
Key Papers
- Paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale — Dosovitskiy et al. (ICLR 2021). The original ViT paper.
- Paper Training Data-Efficient Image Transformers (DeiT) — Touvron et al. (ICML 2021). Knowledge distillation for ViT.
- Paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows — Liu et al. (ICCV 2021).
- Paper A ConvNet for the 2020s (ConvNeXt) — Liu et al. (CVPR 2022). CNN redesigned with ViT principles.
- Paper ViT-5: Vision Transformers for The Mid-2020s — Wang et al. (2026). Modern ViT design principles and best practices.
Papers Discussed in Q&A
- Paper Vision Transformers Need Registers — Darcet et al. (ICLR 2024). Identifies artifact tokens and proposes register tokens.
- Paper Going Deeper with Image Transformers (CaiT) — Touvron et al. (ICCV 2021). Late [CLS] injection for separating patch interaction and classification.
- Paper Revisiting [CLS] and Patch Token Interaction in Vision Transformers — Marouani et al. (ICLR 2026). Decoupled parameterization for [CLS] and patch tokens.