← Back to Course

Week 5: Transformer in Vision

Lecture Notes, Q&A, and Further Reading

Lecture Overview

This lecture covers how the Transformer architecture has been adapted for vision tasks: Vision Transformer (ViT), DeiT, Swin Transformer, and modern design choices including activation functions, positional encodings, and token design.

Supplementary Notes

SwiGLU in Vision Transformers

Supplement: Why GELU over SwiGLU for ViT

SwiGLU works well in NLP Transformers (e.g., LLaMA), but in Vision Transformers it can cause an over-gating problem: the gating mechanism filters out too much information, aggressively suppressing activations. This is related to interactions with LayerScale (a per-channel learnable scaling applied after each residual block) — when combined with SwiGLU's gating, the effective signal can be overly attenuated.

For this reason, GELU remains the recommended activation for Vision Transformers. The over-gating issue is an active area of investigation.

Q&A from Lecture

1. Is [CLS] Token Sufficient for Global Representation?

Q: Can a single [CLS] token capture enough global information for downstream tasks?

For classification: yes. Image classification compresses the entire image into a single label, so one global feature vector is sufficient. [CLS] and GAP perform similarly for this task. The more practical concern, as discussed in lecture, is the alignment problem between [CLS] and patch tokens during training.

For dense prediction: no. Tasks like detection and segmentation require spatial information that a single [CLS] token cannot provide. Different models handle this differently:

CLIP uses only [CLS] (designed for classification and retrieval). SAM discards [CLS] and uses patch token dense features. DINOv2 uses [CLS] for classification but patch tokens for dense tasks. This will be explored further in Week 6 (Detection + Segmentation).

2. Register Tokens

Q: What are register tokens and why are they needed?

Vision Transformers Need Registers (Darcet et al., ICLR 2024) identified an artifact in ViTs like DINOv2 and CLIP: background patches with relatively low information content develop abnormally high norms. This happens because the model repurposes unused background patches as internal memory to store global information — essentially recycling low-information patches as scratch space.

This causes artifacts in attention map visualization and degrades dense prediction performance. The solution is to add a few (typically 4) learnable tokens — similar to [CLS] — that serve as dedicated global information storage. At inference, these register tokens are discarded; only [CLS] and patch tokens are used.

With registers, patch tokens can focus on their intended role (local feature encoding), producing cleaner attention maps and better dense prediction. DINOv2 provides separate model checkpoints with and without registers.

3. Combining [CLS] with GAP or Patch Tokens

Q: Can [CLS] and GAP (or patch tokens) be used together?

Simple concatenation: Yes — concatenating [CLS] with GAP (or max pooling) of patch tokens is used in models like Sentence-BERT when you want both global and local information. This is a straightforward way to combine both signals.

Learnable pooling (MAP): Beyond simple GAP, Multihead Attention Pooling (MAP) uses a learnable query vector that performs cross-attention with all patch tokens. SigLIP/SigLIP 2 uses this approach (though the original SigLIP paper doesn't explicitly mention it, as it builds on the authors' prior work).

Late [CLS] injection (CaiT): CaiT (Touvron et al., 2021) identifies a problem with the standard ViT design: the same attention parameters must simultaneously handle two different jobs — modeling patch-to-patch relationships and aggregating classification information into [CLS]. CaiT separates these by running self-attention among patches only in early layers, then inserting [CLS] in the final 2 layers to read from the patch tokens via cross-attention.

4. Recent Research on [CLS]–Patch Token Interaction

Q: Are there recent works that further explore how [CLS] and patch tokens should interact?

Revisiting [CLS] and Patch Token Interaction in Vision Transformers (Marouani et al., ICLR 2026) takes the CaiT insight further. Rather than only separating [CLS] injection temporally (early vs. late layers), this work separates the processing itself: LayerNorm, LayerScale, and QKV projections are given separate parameters for [CLS] and patch tokens in approximately the first third of the blocks.

Crucially, [CLS] and patch tokens still interact during the attention computation itself, so information exchange is preserved. But the pre- and post-processing around attention is fully decoupled. This recognizes that [CLS] (global summary) and patch tokens (local features) have fundamentally different roles and benefit from separate parameterization.

Further Reading

Course Notes & Lectures

Key Papers

Papers Discussed in Q&A