Table of Contents

Transformer Architecture

The Transformer is a neural network architecture introduced by Vaswani et al. in 2017 that revolutionized sequence modeling by replacing recurrence and convolutions with self-attention mechanisms. It forms the foundation of all modern large language models including GPT, Claude, Llama, and Gemini.

Encoder-Decoder Structure

The original Transformer follows an encoder-decoder design for sequence-to-sequence tasks such as machine translation.

Each sub-layer uses a residual connection followed by layer normalization: $\text{LayerNorm}(x + \text{Sublayer}(x))$.

Positional Encoding

Since the Transformer lacks recurrence, positional information is injected via sinusoidal functions:

$$PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

$$PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

These encodings are added to the input embeddings, allowing the model to attend to relative positions since $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$.

Self-Attention

The core operation is scaled dot-product attention:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Where $Q$ (queries), $K$ (keys), and $V$ (values) are linear projections of the input. Scaling by $\sqrt{d_k}$ prevents the softmax from saturating for large dimensions.

Multi-head attention runs $h$ parallel attention heads with different learned projections:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O$$

$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

Feed-Forward Network

Each layer contains a position-wise FFN applied identically to every token:

$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

The inner dimension is typically $d_{ff} = 4 \cdot d_{model}$, providing nonlinearity and per-position transformation capacity.

Architecture Diagram

graph TB Input["Input Tokens"] --> Embed["Token Embedding + Positional Encoding"] Embed --> Enc1["Encoder Layer 1"] Enc1 --> EncN["Encoder Layer N"] subgraph EncoderLayer["Encoder Layer"] SA["Multi-Head Self-Attention"] --> AN1["Add & LayerNorm"] AN1 --> FFN["Feed-Forward Network"] FFN --> AN2["Add & LayerNorm"] end EncN --> CrossAttn Output["Output Tokens (shifted)"] --> DEmbed["Token Embedding + Positional Encoding"] DEmbed --> Dec1["Decoder Layer 1"] Dec1 --> DecN["Decoder Layer N"] subgraph DecoderLayer["Decoder Layer"] MSA["Masked Self-Attention"] --> DAN1["Add & LayerNorm"] DAN1 --> CrossAttn["Cross-Attention over Encoder"] CrossAttn --> DAN2["Add & LayerNorm"] DAN2 --> DFFN["Feed-Forward Network"] DFFN --> DAN3["Add & LayerNorm"] end DecN --> Linear["Linear + Softmax"] Linear --> Pred["Output Probabilities"]

Key Hyperparameters

Parameter Base Model Big Model
$d_{model}$ (model dimension) 512 1024
$d_k = d_v$ (key/value dim) 64 64
$h$ (attention heads) 8 16
$N$ (layers) 6 6
$d_{ff}$ (FFN inner dim) 2048 4096
Parameters 65M 213M

Code Example

import torch
import torch.nn as nn
import math
 
class ScaledDotProductAttention(nn.Module):
    def forward(self, Q, K, V, mask=None):
        d_k = Q.size(-1)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn_weights = torch.softmax(scores, dim=-1)
        return torch.matmul(attn_weights, V), attn_weights
 
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, n_heads=8):
        super().__init__()
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.attention = ScaledDotProductAttention()
 
    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)
        Q = self.W_q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        out, weights = self.attention(Q, K, V, mask)
        out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_k)
        return self.W_o(out)

References

See Also