← Back to Course

Week 4: From Sequence Modeling to Transformer

Lecture Notes, Q&A, and Further Reading

Lecture Overview

This lecture traces the evolution from RNNs to Transformers: sequence modeling with RNNs and LSTMs, the introduction of the attention mechanism, and how the Transformer architecture replaced recurrence with self-attention entirely.

Q&A from Lecture

Origin of Query, Key, Value Terminology

Q: Where did the terms Query, Key, and Value come from? Were they introduced in the Transformer paper?

The terminology has roots in the Memory Network literature that predates the Transformer. The evolution went roughly: RNN (~1990) → Attention (Bahdanau et al., 2014) → Transformer (Vaswani et al., 2017), and several papers in between introduced these terms incrementally.

End-to-End Memory Networks (Sukhbaatar et al., 2015) introduced the term "query" in the context of QA tasks, where a question is used to address stored memories.

Key-Value Memory Networks (Miller et al., 2016) explicitly introduced "key addressing" and "value reading" — the idea that you use keys to determine where to look, and values to determine what to retrieve.

The Transformer (Vaswani et al., 2017) generalized this into the Q, K, V formulation we know today, defining attention as a general-purpose mechanism: Attention(Q, K, V) = softmax(QKᵀ/√d)V. Notably, the Transformer paper cites End-to-End Memory Networks but does not cite Key-Value Memory Networks, so whether the K/V terminology was a direct influence or an independent convergence remains unclear.

Why 224×224? The AlexNet Typo That Became a Standard

Q: Why do ImageNet models resize inputs to 224×224 (or sometimes 227×227)?

This is one of the more entertaining pieces of deep learning history. The original AlexNet paper (Krizhevsky et al., 2012) states the input size as 224×224×3. However, AlexNet's first layer uses kernel=11, stride=4, no padding, which gives (224 - 11) / 4 + 1 = 54.25 — this doesn't produce an integer, so 224 doesn't actually work.

The official implementation (in Caffe at the time) actually used 227×227, which gives (227 - 11) / 4 + 1 = 55. So 224 in the paper was simply a typo.

Ironically, 224 turned out to be a better choice for later architectures. Since VGGNet and ResNet downsample by a factor of 2 repeatedly (via stride-2 conv or 2×2 pooling), 224 divides cleanly: 224 → 112 → 56 → 28 → 14 → 7. So VGGNet and ResNet adopted 224, and the typo became the standard.

Further Reading

Course Notes & Lectures

Key Papers