Understanding the Foundation of Modern NLP
“Attention Is All You Need” revolutionized NLP and laid the foundation for modern LLMs
Authors: Vaswani et al. (2017), Google Brain & Google Research
Key Innovation: Replace recurrent and convolutional layers with self-attention mechanisms
Original Task: Neural Machine Translation
Impact: - Enabled parallel processing (vs. sequential RNNs) - Captured long-range dependencies better - Foundation for GPT, BERT, T5, and modern LLMs
Encoder-Decoder architecture (Vaswani et al., 2017)
Multi-head attention mechanism (Vaswani et al., 2017)
\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V \]
Components:
Think of attention like searching a library:
Query (Q): Your search request
Key (K): Book titles/descriptions on shelves
Value (V): The actual book content
Attention = Reading partially from all books, weighted by how well each Key matches your Query
Idea: Run attention in parallel with different learned projections
\[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O \]
where each head is:
\[ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) \]
Problem: Self-attention has no notion of sequence order!
Solution: Add positional encodings to input embeddings
\[ \text{Input}_i = \text{TokenEmb}_i + \text{PosEmb}_i \]
Original approach: Goniometric functions (sin, cos)
Alternative: Trainable weights
1. Masked Self-Attention (Causal)
2. Cross-Attention to Encoder Output
Encoder-Decoder architecture (Vaswani et al., 2017)
BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019)
Key Idea: Use only the encoder stack with bidirectional attention
Pre-training tasks:
Variants: RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2020), DistilBERT (Sanh et al., 2019), CzERT (Sido et al., 2021)
GPT (Generative Pre-trained Transformer) (Radford et al., 2018)
Key Idea: Use only the decoder stack with causal (left-to-right) attention
Pre-training: Next token prediction
Evolution: GPT (Radford et al., 2018) → GPT-2 (Radford et al., 2019) → GPT-3 → GPT-4, LLaMA, etc.
| Encoder-Only (BERT) | Decoder-Only (GPT) | |
|---|---|---|
| Attention | Bidirectional | Causal (unidirectional) |
| Best for | Understanding tasks | Generation tasks |
| Examples | Classification | Text generation, completion |
| Training | MLM, NSP | Next token prediction |
| Context | Full sequence | Left context only |
Introduced in BERT - Special token for sequence-level representation
How it works:
Input: [CLS] The movie was great [SEP]
↓ ↓ ↓ ↓ ↓ ↓
Encoder: Apply bidirectional self-attention
↓ ↓ ↓ ↓ ↓ ↓
Output: [h_CLS] [h_1] [h_2] [h_3] [h_4] [h_5]
Key insight: [CLS] attends to all other tokens → learns holistic representation
BERT fine-tuning for different tasks (Devlin et al., 2019)
Next: We’ll see how these architectures power recommendation systems!