====== Transformer Architecture ======
The Transformer is a neural network architecture introduced by Vaswani et al. in 2017 that revolutionized sequence modeling by replacing recurrence and convolutions with self-attention mechanisms. It forms the foundation of all modern large language models including GPT, Claude, Llama, and Gemini.
===== Encoder-Decoder Structure =====
The original Transformer follows an encoder-decoder design for sequence-to-sequence tasks such as machine translation.
* **Encoder**: A stack of $N$ identical layers, each containing a multi-head self-attention sub-layer and a position-wise feed-forward network. It processes the full input sequence into contextualized representations.
* **Decoder**: Also $N$ identical layers, each with three sub-layers: masked multi-head self-attention (preventing attention to future positions), multi-head cross-attention over encoder output, and a position-wise feed-forward network. The decoder generates tokens autoregressively.
Each sub-layer uses a residual connection followed by layer normalization: $\text{LayerNorm}(x + \text{Sublayer}(x))$.
===== Positional Encoding =====
Since the Transformer lacks recurrence, positional information is injected via sinusoidal functions:
$$PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
$$PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
These encodings are added to the input embeddings, allowing the model to attend to relative positions since $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$.
===== Self-Attention =====
The core operation is **scaled dot-product attention**:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
Where $Q$ (queries), $K$ (keys), and $V$ (values) are linear projections of the input. Scaling by $\sqrt{d_k}$ prevents the softmax from saturating for large dimensions.
**Multi-head attention** runs $h$ parallel attention heads with different learned projections:
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O$$
$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$
===== Feed-Forward Network =====
Each layer contains a position-wise FFN applied identically to every token:
$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$
The inner dimension is typically $d_{ff} = 4 \cdot d_{model}$, providing nonlinearity and per-position transformation capacity.
===== Architecture Diagram =====
graph TB
Input["Input Tokens"] --> Embed["Token Embedding + Positional Encoding"]
Embed --> Enc1["Encoder Layer 1"]
Enc1 --> EncN["Encoder Layer N"]
subgraph EncoderLayer["Encoder Layer"]
SA["Multi-Head Self-Attention"] --> AN1["Add & LayerNorm"]
AN1 --> FFN["Feed-Forward Network"]
FFN --> AN2["Add & LayerNorm"]
end
EncN --> CrossAttn
Output["Output Tokens (shifted)"] --> DEmbed["Token Embedding + Positional Encoding"]
DEmbed --> Dec1["Decoder Layer 1"]
Dec1 --> DecN["Decoder Layer N"]
subgraph DecoderLayer["Decoder Layer"]
MSA["Masked Self-Attention"] --> DAN1["Add & LayerNorm"]
DAN1 --> CrossAttn["Cross-Attention over Encoder"]
CrossAttn --> DAN2["Add & LayerNorm"]
DAN2 --> DFFN["Feed-Forward Network"]
DFFN --> DAN3["Add & LayerNorm"]
end
DecN --> Linear["Linear + Softmax"]
Linear --> Pred["Output Probabilities"]
===== Key Hyperparameters =====
^ Parameter ^ Base Model ^ Big Model ^
| $d_{model}$ (model dimension) | 512 | 1024 |
| $d_k = d_v$ (key/value dim) | 64 | 64 |
| $h$ (attention heads) | 8 | 16 |
| $N$ (layers) | 6 | 6 |
| $d_{ff}$ (FFN inner dim) | 2048 | 4096 |
| Parameters | 65M | 213M |
===== Code Example =====
import torch
import torch.nn as nn
import math
class ScaledDotProductAttention(nn.Module):
def forward(self, Q, K, V, mask=None):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn_weights = torch.softmax(scores, dim=-1)
return torch.matmul(attn_weights, V), attn_weights
class MultiHeadAttention(nn.Module):
def __init__(self, d_model=512, n_heads=8):
super().__init__()
self.d_k = d_model // n_heads
self.n_heads = n_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
self.attention = ScaledDotProductAttention()
def forward(self, Q, K, V, mask=None):
batch_size = Q.size(0)
Q = self.W_q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
K = self.W_k(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
V = self.W_v(V).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
out, weights = self.attention(Q, K, V, mask)
out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_k)
return self.W_o(out)
===== References =====
* [[https://arxiv.org/abs/1706.03762|Vaswani et al. - Attention Is All You Need (2017)]]
* [[https://arxiv.org/abs/2002.04745|Xiong et al. - On Layer Normalization in the Transformer Architecture]]
* [[https://arxiv.org/abs/1607.06450|Ba et al. - Layer Normalization]]
===== See Also =====
* [[attention_mechanism|Attention Mechanism]]
* [[model_context_window|Model Context Window]]
* [[inference_optimization|Inference Optimization]]
* [[tokenization|Tokenization]]