====== Scaling Laws ======
**Scaling laws** are mathematical relationships, typically expressed as power laws, that model how model performance scales with computational resources, model parameters, training data volume, and other training-related variables. These empirical relationships enable researchers and practitioners to predict model performance before training and to optimize the allocation of computational resources across model size and dataset dimensions.

===== Overview and Definition =====
Scaling laws describe the relationship between input variables (such as compute budget, number of model parameters, and dataset size) and output performance metrics (typically measured as [[test_loss|test loss]] or downstream task performance). Rather than requiring full model training to understand performance characteristics, scaling laws allow practitioners to estimate the expected performance of a model given specific resource constraints (([[https://arxiv.org/abs/2001.08361|Kaplan et al. - Scaling Laws for Neural Language Models (2020]])).

The fundamental insight of scaling laws is that model performance typically follows a power-law relationship with respect to scale dimensions. This means that doubling a resource input (such as compute or parameters) yields a predictable, quantifiable improvement in performance, rather than an exponential or diminishing gain. This predictability has profound implications for resource planning and model development strategy (([[https://arxiv.org/abs/2010.14701|Hoffmann et al. - Training Compute-Optimal Large Language Models (2022]])).

===== Mathematical Framework =====
Scaling laws are typically expressed using power-law equations of the form:

L(N) = aN^(-α) + ε

where L represents model loss, N represents a scaling dimension (such as parameter count or training tokens), a is a proportionality constant, α is the scaling exponent, and ε represents irreducible loss.

Research has identified distinct scaling exponents for different dimensions. The compute scaling exponent typically ranges from 0.05 to 0.08, meaning that model loss decreases roughly proportionally to compute raised to a negative power between 0.05 and 0.08. Parameter count and dataset size exhibit their own characteristic scaling exponents (([[https://arxiv.org/abs/2105.03824|Hoffmann et al. - An Empirical Analysis of Compute-Optimal Transformer Language Models (2022]])).

Foundational research has demonstrated smooth [[power_law|power law]] relationships between model parameters, data volume, training compute, and test loss across eight orders of magnitude of compute, establishing core principles that have become widely adopted in the field (([[https://cameronrwolfe.substack.com/p/rl-scaling-laws|Deep Learning Focus - Neural Scaling Laws (2026]])). A widely adopted compute approximation in the field models the relationship between compute budget C, parameters N, and training data D with the formula C = 6ND.

The **[[chinchilla_paper|Chinchilla]] scaling laws** identified optimal allocation ratios between model parameters and training tokens, suggesting that for a given compute budget, the number of training tokens should be approximately equal to the number of model parameters. This contradicted earlier approaches that allocated disproportionately larger budgets to model size relative to training data (([[https://arxiv.org/abs/2203.15556|Hoffmann et al. - Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022]])).

===== Empirical Evidence and Applications =====
Scaling laws have been demonstrated across multiple scales and domains. Research on large language models ranging from billions to hundreds of billions of parameters has consistently shown that power-law relationships hold across these scales, enabling extrapolation of performance to larger models than have been trained. This capability has been applied to estimate the performance of models that would require prohibitive computational resources to train directly.

Practitioners use scaling laws for several purposes: predicting the performance of proposed models before committing resources to training, determining optimal model size given a fixed compute budget, and deciding whether to allocate additional resources to increase model size or dataset size. Organizations can use these relationships to make cost-benefit decisions about infrastructure investment and model development priorities (([[https://arxiv.org/abs/2107.04668|Kaplan et al. - Scaling Laws for Transfer (2021]])).

Scaling laws have also been identified for capabilities beyond basic language modeling. Research has shown that downstream task performance, reasoning capabilities, and other emergent properties follow similar scaling relationships, though with different exponents and constants depending on the specific capability being measured.

===== Limitations and Challenges =====
While scaling laws have proven remarkably consistent within observed ranges, several limitations constrain their application. The functional form of scaling laws may change at extreme scales, and extrapolation far beyond observed data carries significant uncertainty. Additionally, scaling laws typically describe test loss or aggregate metrics; they provide limited insight into qualitative differences in model behavior, failure modes, or alignment properties as models scale.

Scaling laws also do not account for architectural innovations, training methodologies, or post-training techniques like instruction tuning and [[rlhf|reinforcement learning from human feedback]], which may substantially alter the relationship between scale and performance. Furthermore, the energy and environmental costs of scaling, along with considerations of fairness and accessibility, represent practical constraints that scaling laws alone do not address.

Different domains and tasks may exhibit different scaling characteristics, and scaling laws derived from one domain may not transfer reliably to others. The presence of phase transitions or capability jumps at certain scales remains incompletely understood, limiting the predictive power of simple power-law models.


===== See Also =====

  * [[log_scale_vs_normal_scale_plots|Log Scale vs Normal Scale Scaling Law Plots]]
  * [[pretraining_scaling_vs_rl_scaling|Pretraining Scaling Laws vs RL Scaling Laws]]
  * [[power_law|Power Law]]
  * [[modelweights|Model Weights]]

===== References =====