← Back to Course

Week 3: Convolutional Neural Networks

Lecture Notes, Q&A, and Further Reading

Lecture Overview

This lecture covers CNNs in four parts: (1) why fully-connected networks aren't enough for images and how CNN inductive biases solve this, (2) the convolution operation and related components, (3) the evolution of CNN architectures from AlexNet to ConvNeXt, and (4) what CNNs learn and how to leverage pretrained models.

Q&A from Lecture

Translation Equivariance vs. Invariance

Q: What is the difference between translation equivariance and translation invariance?

Translation equivariance means that if the input shifts, the output shifts correspondingly. Convolution provides this: if a cat moves 10 pixels to the right in the input, the feature map activation also moves 10 pixels to the right. Formally, f(translate(x)) = translate(f(x)).

Translation invariance means that the output stays the same regardless of where the object is. Pooling layers provide a degree of invariance to small spatial shifts — within the pooling window, slight position changes produce the same pooled output.

In summary: convolution gives equivariance, and pooling adds partial invariance on top of that.

Q: Does "translation" include flip and rotation, or only shift?

Translation refers to spatial shift only (moving the image left/right/up/down). Flip and rotation are separate geometric transformations. Standard CNNs are not inherently invariant to rotation or flipping — this is actually a known limitation of CNNs. In practice, this is addressed through data augmentation (random flips, rotations during training) rather than through the architecture itself.

Separable Convolution and Factorization

Q: How does separable convolution reduce computation?

A standard convolution has cost proportional to C_in · C_out · K² — a product of three terms. Depthwise separable convolution factorizes this into two steps: depthwise conv (C_in · K²) + pointwise conv (C_in · C_out). The three-way product becomes a sum of two-way products: C_in · K² + C_in · C_out.

Each individual separable block is slightly weaker than a standard conv, but because it's much cheaper, you can stack more layers within the same computational budget. This is exactly the logic behind ResNet's bottleneck block: 3 layers with 17HWC² FLOPs vs. 2 layers with 18HWC² FLOPs — more layers, fewer operations, better performance.

This principle of factorizing expensive operations appears beyond CNNs as well — similar ideas show up in efficient attention mechanisms (e.g., linear attention, grouped query attention).

Stacking Linear Layers

Q: If stacking linear layers without activation is mathematically equivalent to a single linear layer, why can it still help in practice?

Mathematically, W₂(W₁x) = (W₂W₁)x, so the expressivity is identical. However, the optimization landscape differs. When learning W = W₂W₁ in factored form, gradient descent exhibits an implicit bias toward low-rank solutions — large singular values grow faster while small ones stay near zero. This acts as a form of implicit regularization, similar to nuclear norm regularization.

In short: same expressivity, but the factored form reaches simpler, potentially better-generalizing solutions. See Arora et al. (2019), "Implicit Regularization in Deep Matrix Factorization" for theoretical analysis.

ConvNeXt Design Choices

Q: What is "stage ratio" in ConvNeXt's macro design?

Stage ratio refers to the number of blocks allocated to each stage. ResNet-50 uses (3, 4, 6, 3), while Swin Transformer uses a roughly 1:1:3:1 ratio. ConvNeXt adopted (3, 3, 9, 3), concentrating more computation in Stage 3 (the mid-resolution stage). This change alone improved accuracy from 78.8% to 79.4% on ImageNet-1K.

Q: What do "fewer activations" and "fewer norms" mean in ConvNeXt?

In a standard ResNet block, every conv layer is followed by both BatchNorm and ReLU: Conv→BN→ReLU→Conv→BN→ReLU. Transformer blocks, by contrast, have only one norm at the block start and one activation inside the MLP.

ConvNeXt follows the Transformer pattern: GELU activation is reduced from two to one (only after the first 1×1 conv), and LayerNorm is placed only after the depthwise conv. Each of these changes individually improved accuracy (81.3% and 81.4% respectively).

This might seem to contradict the idea that "non-linearity is essential" (since stacking conv without activation collapses to a single linear operation). The resolution is: non-linearity is necessary, but excess non-linearity can hinder optimization. Placing activations only where they matter most — as Transformers do — turns out to be more effective.

CAM and Grad-CAM

Q: If Grad-CAM doesn't require GAP in the architecture, why does the explanation involve averaging?

CAM requires a GAP layer to be part of the model architecture — it uses the FC weights after GAP to weight the feature maps. This limits it to specific architectures (e.g., GoogLeNet, ResNet with GAP).

Grad-CAM removes this architectural constraint. Instead, it computes gradients of the class score with respect to the last conv layer's feature maps, then performs spatial averaging on those gradients as a post-processing step (not as a model component). This averaging produces per-channel importance weights (α_c), which are then used to create a weighted sum of feature maps. A final ReLU keeps only positively contributing regions.

In short: CAM needs GAP in the model; Grad-CAM does the averaging outside the model, so it works with any CNN.

Further Reading

Course Notes & Lectures

Key Papers

Supplementary