====== Transformer Architecture ====== The Transformer is a neural network architecture introduced by Vaswani et al. in 2017 that revolutionized sequence modeling by replacing recurrence and convolutions with self-attention mechanisms. It forms the foundation of all modern large language models including GPT, Claude, Llama, and Gemini. ===== Encoder-Decoder Structure ===== The original Transformer follows an encoder-decoder design for sequence-to-sequence tasks such as machine translation. * **Encoder**: A stack of $N$ identical layers, each containing a multi-head self-attention sub-layer and a position-wise feed-forward network. It processes the full input sequence into contextualized representations. * **Decoder**: Also $N$ identical layers, each with three sub-layers: masked multi-head self-attention (preventing attention to future positions), multi-head cross-attention over encoder output, and a position-wise feed-forward network. The decoder generates tokens autoregressively. Each sub-layer uses a residual connection followed by layer normalization: $\text{LayerNorm}(x + \text{Sublayer}(x))$. ===== Positional Encoding ===== Since the Transformer lacks recurrence, positional information is injected via sinusoidal functions: $$PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$ $$PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$ These encodings are added to the input embeddings, allowing the model to attend to relative positions since $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$. ===== Self-Attention ===== The core operation is **scaled dot-product attention**: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$ Where $Q$ (queries), $K$ (keys), and $V$ (values) are linear projections of the input. Scaling by $\sqrt{d_k}$ prevents the softmax from saturating for large dimensions. **Multi-head attention** runs $h$ parallel attention heads with different learned projections: $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O$$ $$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$ ===== Feed-Forward Network ===== Each layer contains a position-wise FFN applied identically to every token: $$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$ The inner dimension is typically $d_{ff} = 4 \cdot d_{model}$, providing nonlinearity and per-position transformation capacity. ===== Architecture Diagram ===== graph TB Input["Input Tokens"] --> Embed["Token Embedding + Positional Encoding"] Embed --> Enc1["Encoder Layer 1"] Enc1 --> EncN["Encoder Layer N"] subgraph EncoderLayer["Encoder Layer"] SA["Multi-Head Self-Attention"] --> AN1["Add & LayerNorm"] AN1 --> FFN["Feed-Forward Network"] FFN --> AN2["Add & LayerNorm"] end EncN --> CrossAttn Output["Output Tokens (shifted)"] --> DEmbed["Token Embedding + Positional Encoding"] DEmbed --> Dec1["Decoder Layer 1"] Dec1 --> DecN["Decoder Layer N"] subgraph DecoderLayer["Decoder Layer"] MSA["Masked Self-Attention"] --> DAN1["Add & LayerNorm"] DAN1 --> CrossAttn["Cross-Attention over Encoder"] CrossAttn --> DAN2["Add & LayerNorm"] DAN2 --> DFFN["Feed-Forward Network"] DFFN --> DAN3["Add & LayerNorm"] end DecN --> Linear["Linear + Softmax"] Linear --> Pred["Output Probabilities"] ===== Key Hyperparameters ===== ^ Parameter ^ Base Model ^ Big Model ^ | $d_{model}$ (model dimension) | 512 | 1024 | | $d_k = d_v$ (key/value dim) | 64 | 64 | | $h$ (attention heads) | 8 | 16 | | $N$ (layers) | 6 | 6 | | $d_{ff}$ (FFN inner dim) | 2048 | 4096 | | Parameters | 65M | 213M | ===== Code Example ===== import torch import torch.nn as nn import math class ScaledDotProductAttention(nn.Module): def forward(self, Q, K, V, mask=None): d_k = Q.size(-1) scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k) if mask is not None: scores = scores.masked_fill(mask == 0, float('-inf')) attn_weights = torch.softmax(scores, dim=-1) return torch.matmul(attn_weights, V), attn_weights class MultiHeadAttention(nn.Module): def __init__(self, d_model=512, n_heads=8): super().__init__() self.d_k = d_model // n_heads self.n_heads = n_heads self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) self.W_o = nn.Linear(d_model, d_model) self.attention = ScaledDotProductAttention() def forward(self, Q, K, V, mask=None): batch_size = Q.size(0) Q = self.W_q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2) K = self.W_k(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2) V = self.W_v(V).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2) out, weights = self.attention(Q, K, V, mask) out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_k) return self.W_o(out) ===== References ===== * [[https://arxiv.org/abs/1706.03762|Vaswani et al. - Attention Is All You Need (2017)]] * [[https://arxiv.org/abs/2002.04745|Xiong et al. - On Layer Normalization in the Transformer Architecture]] * [[https://arxiv.org/abs/1607.06450|Ba et al. - Layer Normalization]] ===== See Also ===== * [[attention_mechanism|Attention Mechanism]] * [[model_context_window|Model Context Window]] * [[inference_optimization|Inference Optimization]] * [[tokenization|Tokenization]]