The Transformer is a neural network architecture introduced by Vaswani et al. in 2017 that revolutionized sequence modeling by replacing recurrence and convolutions with self-attention mechanisms. It forms the foundation of all modern large language models including GPT, Claude, Llama, and Gemini.
The original Transformer follows an encoder-decoder design for sequence-to-sequence tasks such as machine translation.
Each sub-layer uses a residual connection followed by layer normalization: $\text{LayerNorm}(x + \text{Sublayer}(x))$.
Since the Transformer lacks recurrence, positional information is injected via sinusoidal functions:
$$PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
$$PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
These encodings are added to the input embeddings, allowing the model to attend to relative positions since $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$.
The core operation is scaled dot-product attention:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
Where $Q$ (queries), $K$ (keys), and $V$ (values) are linear projections of the input. Scaling by $\sqrt{d_k}$ prevents the softmax from saturating for large dimensions.
Multi-head attention runs $h$ parallel attention heads with different learned projections:
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O$$
$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$
Each layer contains a position-wise FFN applied identically to every token:
$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$
The inner dimension is typically $d_{ff} = 4 \cdot d_{model}$, providing nonlinearity and per-position transformation capacity.
| Parameter | Base Model | Big Model |
|---|---|---|
| $d_{model}$ (model dimension) | 512 | 1024 |
| $d_k = d_v$ (key/value dim) | 64 | 64 |
| $h$ (attention heads) | 8 | 16 |
| $N$ (layers) | 6 | 6 |
| $d_{ff}$ (FFN inner dim) | 2048 | 4096 |
| Parameters | 65M | 213M |
import torch import torch.nn as nn import math class ScaledDotProductAttention(nn.Module): def forward(self, Q, K, V, mask=None): d_k = Q.size(-1) scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k) if mask is not None: scores = scores.masked_fill(mask == 0, float('-inf')) attn_weights = torch.softmax(scores, dim=-1) return torch.matmul(attn_weights, V), attn_weights class MultiHeadAttention(nn.Module): def __init__(self, d_model=512, n_heads=8): super().__init__() self.d_k = d_model // n_heads self.n_heads = n_heads self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) self.W_o = nn.Linear(d_model, d_model) self.attention = ScaledDotProductAttention() def forward(self, Q, K, V, mask=None): batch_size = Q.size(0) Q = self.W_q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2) K = self.W_k(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2) V = self.W_v(V).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2) out, weights = self.attention(Q, K, V, mask) out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_k) return self.W_o(out)