Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Scaling laws describe the relationship between model capacity, training compute, and performance in machine learning systems. The choice of visualization method—logarithmic or normal (linear) scale—fundamentally affects how researchers and practitioners interpret scaling dynamics and their implications for model development. Log-log plots present scaling relationships as linear functions, while normal-scale plots reveal the underlying exponential patterns, each offering distinct intuitions about the practical constraints of scaling neural networks 1).
Scaling laws in deep learning typically follow power-law relationships of the form L(C) = aC^(-b), where L represents loss or error, C denotes compute budget, and a and b are empirically determined constants 2). This relationship holds approximately across multiple dimensions: model parameters, training tokens, and computational resources.
When plotted on logarithmic scales (log-log plots), power-law functions become linear, transforming the equation into log(L) = log(a) - b·log(C). This linearity provides elegant mathematical properties and makes it easy to identify the exponent b through simple linear regression. However, this same linearization masks the true complexity of the underlying relationship when viewed in natural (linear) scale coordinates 3).
Log-Scale Plots: Log-log plots present scaling relationships as straight lines, creating an intuition of consistent, predictable progress. This representation emphasizes the exponent value and makes comparisons between different scaling regimes appear mathematically clean. A scaling law with exponent b = -0.07, for instance, plots as a line with slope -0.07, regardless of whether the model has grown from 1M to 1B parameters or from 100B to 1T parameters. This visual consistency can suggest that scaling improvements continue indefinitely at a stable rate.
Normal-Scale Plots: When the same power-law relationship is plotted with linear axes, the curve becomes exponential and non-linear. The initial improvements appear dramatic—moving from 10M to 100M parameters might show substantial quality gains—but subsequent proportional increases in compute yield progressively smaller improvements. By the time models reach trillions of parameters, each additional order-of-magnitude increase in compute produces diminishing returns that become visually apparent as a flattening curve 4).
The choice of scale significantly impacts decision-making in AI development. Log-scale plots may encourage sustained investment in scaling, as the linear appearance suggests that resources devoted to scaling will yield consistent, predictable improvements. This visualization naturally supports the narrative of stable, exponential progress in model capabilities.
Normal-scale plots, conversely, highlight the practical challenge that achieving marginal quality improvements becomes exponentially more expensive as models scale. A system requiring 10^20 FLOP to achieve 85% accuracy might need 10^21 FLOP to reach 87%—a ten-fold increase in computation for only a 2 percentage-point improvement. This representation more naturally encodes the diminishing returns that practitioners experience, providing clearer guidance for resource allocation decisions and highlighting the computational barriers to continued progress 5).
In reinforcement learning scaling studies, the choice between log and normal scales influences how teams interpret the relationship between training compute and policy performance. Log-log plots may suggest that doubling compute always yields a fixed percentage improvement in win rate or score, while normal-scale plots reveal that the absolute performance gain shrinks as agents approach optimal play.
For large language model development, normal-scale visualization of loss curves more accurately reflects the challenge that reducing loss from 2.0 to 1.5 bits per token requires vastly more computation than the reduction from 3.0 to 2.5 bits per token, even though both represent similar fractional improvements in the log-scale space. This distinction directly impacts decisions about model sizes, training durations, and continuation thresholds.
Log-log plots serve important roles in mathematical analysis and academic communication, enabling clear identification of scaling exponents and comparison across wide ranges of values. However, normal-scale plots better communicate the physical reality of computational constraints and the lived experience of practitioners observing training progress.
Comprehensive scaling law analysis benefits from presenting both visualizations, as each reveals different aspects of the underlying dynamics. The linear representation in log space clarifies mathematical relationships and facilitates parameter estimation, while the exponential appearance in normal space more intuitively conveys the escalating difficulty of marginal improvements and the diminishing returns inherent to power-law relationships.