Lecture Overview
This lecture covers CNNs in four parts: (1) why fully-connected networks aren't enough for images and how CNN inductive biases solve this, (2) the convolution operation and related components, (3) the evolution of CNN architectures from AlexNet to ConvNeXt, and (4) what CNNs learn and how to leverage pretrained models.
Q&A from Lecture
Translation Equivariance vs. Invariance
Translation equivariance means that if the input shifts, the output shifts correspondingly. Convolution provides this: if a cat moves 10 pixels to the right in the input, the feature map activation also moves 10 pixels to the right. Formally, f(translate(x)) = translate(f(x)).
Translation invariance means that the output stays the same regardless of where the object is. Pooling layers provide a degree of invariance to small spatial shifts — within the pooling window, slight position changes produce the same pooled output.
In summary: convolution gives equivariance, and pooling adds partial invariance on top of that.
Translation refers to spatial shift only (moving the image left/right/up/down). Flip and rotation are separate geometric transformations. Standard CNNs are not inherently invariant to rotation or flipping — this is actually a known limitation of CNNs. In practice, this is addressed through data augmentation (random flips, rotations during training) rather than through the architecture itself.
Separable Convolution and Factorization
A standard convolution has cost proportional to C_in · C_out · K² — a product of three terms. Depthwise separable convolution factorizes this into two steps: depthwise conv (C_in · K²) + pointwise conv (C_in · C_out). The three-way product becomes a sum of two-way products: C_in · K² + C_in · C_out.
Each individual separable block is slightly weaker than a standard conv, but because it's much cheaper, you can stack more layers within the same computational budget. This is exactly the logic behind ResNet's bottleneck block: 3 layers with 17HWC² FLOPs vs. 2 layers with 18HWC² FLOPs — more layers, fewer operations, better performance.
This principle of factorizing expensive operations appears beyond CNNs as well — similar ideas show up in efficient attention mechanisms (e.g., linear attention, grouped query attention).
Stacking Linear Layers
Mathematically, W₂(W₁x) = (W₂W₁)x, so the expressivity is identical. However, the optimization landscape differs. When learning W = W₂W₁ in factored form, gradient descent exhibits an implicit bias toward low-rank solutions — large singular values grow faster while small ones stay near zero. This acts as a form of implicit regularization, similar to nuclear norm regularization.
In short: same expressivity, but the factored form reaches simpler, potentially better-generalizing solutions. See Arora et al. (2019), "Implicit Regularization in Deep Matrix Factorization" for theoretical analysis.
ConvNeXt Design Choices
Stage ratio refers to the number of blocks allocated to each stage. ResNet-50 uses (3, 4, 6, 3), while Swin Transformer uses a roughly 1:1:3:1 ratio. ConvNeXt adopted (3, 3, 9, 3), concentrating more computation in Stage 3 (the mid-resolution stage). This change alone improved accuracy from 78.8% to 79.4% on ImageNet-1K.
In a standard ResNet block, every conv layer is followed by both BatchNorm and ReLU: Conv→BN→ReLU→Conv→BN→ReLU. Transformer blocks, by contrast, have only one norm at the block start and one activation inside the MLP.
ConvNeXt follows the Transformer pattern: GELU activation is reduced from two to one (only after the first 1×1 conv), and LayerNorm is placed only after the depthwise conv. Each of these changes individually improved accuracy (81.3% and 81.4% respectively).
This might seem to contradict the idea that "non-linearity is essential" (since stacking conv without activation collapses to a single linear operation). The resolution is: non-linearity is necessary, but excess non-linearity can hinder optimization. Placing activations only where they matter most — as Transformers do — turns out to be more effective.
CAM and Grad-CAM
CAM requires a GAP layer to be part of the model architecture — it uses the FC weights after GAP to weight the feature maps. This limits it to specific architectures (e.g., GoogLeNet, ResNet with GAP).
Grad-CAM removes this architectural constraint. Instead, it computes gradients of the class score with respect to the last conv layer's feature maps, then performs spatial averaging on those gradients as a post-processing step (not as a model component). This averaging produces per-channel importance weights (α_c), which are then used to create a weighted sum of feature maps. A final ReLU keeps only positively contributing regions.
In short: CAM needs GAP in the model; Grad-CAM does the averaging outside the model, so it works with any CNN.
Further Reading
Course Notes & Lectures
- Notes CS231n: Convolutional Neural Networks — Comprehensive notes on conv layers, pooling, and architectures
- Video Michigan EECS 498 L7: Convolutional Networks — Justin Johnson's detailed walkthrough of convolution operations
- Video Michigan EECS 498 L8: CNN Architectures — AlexNet, VGG, GoogLeNet, ResNet in depth
Key Papers
- Paper AlexNet — Krizhevsky, Sutskever, Hinton (NeurIPS 2012)
- Paper VGGNet — Simonyan & Zisserman (ICLR 2015)
- Paper ResNet: Deep Residual Learning — He et al. (CVPR 2016)
- Paper ConvNeXt: A ConvNet for the 2020s — Liu et al. (CVPR 2022)
- Paper Grad-CAM — Selvaraju et al. (ICCV 2017)
Supplementary
- Supplement Lecture slides from the previous instructor (Prof. Injung Kim) covering convolution backpropagation, transposed convolution, and convolution as matrix multiplication are available on LMS for students who want deeper operational details.
- Paper Implicit Regularization in Deep Matrix Factorization — Arora et al. (NeurIPS 2019). Explains why factored linear layers reach different solutions than single layers.
- Paper MobileNets: Efficient CNNs for Mobile Vision Applications — Howard et al. (2017). A key paper popularizing depthwise separable convolution paper.