Lecture Overview
This lecture traces the evolution from RNNs to Transformers: sequence modeling with RNNs and LSTMs, the introduction of the attention mechanism, and how the Transformer architecture replaced recurrence with self-attention entirely.
Q&A from Lecture
Origin of Query, Key, Value Terminology
The terminology has roots in the Memory Network literature that predates the Transformer. The evolution went roughly: RNN (~1990) → Attention (Bahdanau et al., 2014) → Transformer (Vaswani et al., 2017), and several papers in between introduced these terms incrementally.
End-to-End Memory Networks (Sukhbaatar et al., 2015) introduced the term "query" in the context of QA tasks, where a question is used to address stored memories.
Key-Value Memory Networks (Miller et al., 2016) explicitly introduced "key addressing" and "value reading" — the idea that you use keys to determine where to look, and values to determine what to retrieve.
The Transformer (Vaswani et al., 2017) generalized this into the Q, K, V formulation we know today, defining attention as a general-purpose mechanism: Attention(Q, K, V) = softmax(QKᵀ/√d)V. Notably, the Transformer paper cites End-to-End Memory Networks but does not cite Key-Value Memory Networks, so whether the K/V terminology was a direct influence or an independent convergence remains unclear.
Why 224×224? The AlexNet Typo That Became a Standard
This is one of the more entertaining pieces of deep learning history. The original AlexNet paper (Krizhevsky et al., 2012) states the input size as 224×224×3. However, AlexNet's first layer uses kernel=11, stride=4, no padding, which gives (224 - 11) / 4 + 1 = 54.25 — this doesn't produce an integer, so 224 doesn't actually work.
The official implementation (in Caffe at the time) actually used 227×227, which gives (227 - 11) / 4 + 1 = 55. So 224 in the paper was simply a typo.
Ironically, 224 turned out to be a better choice for later architectures. Since VGGNet and ResNet downsample by a factor of 2 repeatedly (via stride-2 conv or 2×2 pooling), 224 divides cleanly: 224 → 112 → 56 → 28 → 14 → 7. So VGGNet and ResNet adopted 224, and the typo became the standard.
Further Reading
Course Notes & Lectures
- Video Michigan EECS 498 L12: Recurrent Networks — RNN, LSTM, GRU, language modeling, image captioning
- Video Michigan EECS 498 L13: Attention — Multimodal attention, self-attention, Transformers
- Blog The Illustrated Transformer — Jay Alammar's visual guide to the Transformer architecture
- Blog Understanding LSTM Networks — Chris Olah's classic visual explanation of LSTMs
- Blog Transformers from Scratch — Peter Bloem's detailed implementation walkthrough
Key Papers
- Paper Neural Machine Translation by Jointly Learning to Align and Translate — Bahdanau et al. (2014). The paper that introduced attention for sequence-to-sequence models.
- Paper Attention Is All You Need — Vaswani et al. (NeurIPS 2017). The Transformer paper.
- Paper End-To-End Memory Networks — Sukhbaatar et al. (NeurIPS 2015). Introduced "query" terminology for memory addressing.
- Paper Key-Value Memory Networks for Directly Reading Documents — Miller et al. (EMNLP 2016). Introduced key/value separation for memory access.