Model Compression and Quantization

Model compression and quantization represent fundamental techniques for reducing the computational and memory requirements of large language models (LLMs) while maintaining acceptable performance levels. These methods enable deployment on resource-constrained devices, reduce inference latency, and lower operational costs—critical considerations for practical machine learning systems. Quantization involves representing model weights and activations using lower-precision numerical formats, while compression encompasses a broader set of techniques including pruning, distillation, and bit-width reduction.

Overview and Motivation

Large language models with billions or hundreds of billions of parameters require substantial memory and computational resources. Standard implementations use 32-bit floating-point (FP32) or 16-bit floating-point (FP16) representations, with each parameter consuming corresponding bytes of memory. A 7 billion parameter model in FP32 requires approximately 28 GB of memory for weights alone, creating barriers to deployment on consumer hardware, edge devices, and cost-constrained inference infrastructure ¹⁾.

Quantization addresses this constraint by representing parameters using lower bit-widths—commonly 8-bit, 4-bit, or even binary representations. This approach can reduce model size by 75-95% compared to full precision, enabling deployment scenarios previously infeasible. The fundamental challenge involves maintaining model accuracy while reducing bit-width, as aggressive quantization introduces rounding errors that can degrade performance ²⁾.

Quantization Methods and Technical Approaches

Post-Training Quantization (PTQ) represents the simplest approach, applying quantization to pre-trained models without additional training. PTQ determines quantization parameters (scale factors and zero-points) based on the distribution of activations and weights observed during a calibration phase on representative data. This method incurs minimal computational overhead but may suffer accuracy degradation, particularly with aggressive 4-bit or lower quantization schemes.

Quantization-Aware Training (QAT) incorporates quantization into the training process, simulating quantized inference during training through straight-through estimators or other gradient approximation techniques. QAT typically recovers 1-3% accuracy loss compared to post-training approaches but requires access to training data and computational resources for fine-tuning ³⁾.

4-Bit Quantization has emerged as a practical sweet spot, reducing model size to approximately one-quarter of FP32 requirements while maintaining reasonable accuracy for many applications. Methods like QLoRA combine 4-bit quantization with low-rank adaptation, enabling efficient fine-tuning of quantized models without full precision backward passes ⁴⁾.

Stochastic rounding introduces probabilistic precision reduction rather than deterministic truncation. Instead of rounding to the nearest representable value, stochastic rounding selects the lower or upper quantization level with probability proportional to distance. This approach reduces systematic bias and can improve convergence properties in training scenarios.

RHT (Row-wise Hessian Trace) stabilization represents a technique for improving quantization stability by computing per-row scaling factors based on second-order information. Rather than using uniform or channel-wise scales, RHT-based methods account for the importance of different rows in the weight matrix, protecting sensitive parameters from aggressive quantization while allowing more aggressive reduction in less critical parameters.

Applications and Practical Implementation

Quantized models enable deployment across diverse scenarios: mobile devices with limited memory and compute, edge inference on embedded systems, cost-effective cloud inference through reduced GPU memory requirements, and real-time inference applications where latency constraints prevent use of full-precision models. Companies deploying LLMs at scale commonly utilize 4-bit or 8-bit quantization to reduce infrastructure costs and improve inference throughput.

Inference acceleration represents the primary application, as quantized operations complete faster on specialized hardware. 8-bit matrix operations execute efficiently on modern GPUs and CPUs, while 4-bit operations benefit from custom kernels optimized for low-precision arithmetic. Quantization also reduces bandwidth requirements between memory and compute units, becoming the bottleneck for large model inference.

Fine-tuning quantized models through techniques like QLoRA enables adaptation to downstream tasks without maintaining full precision weights. This approach combines 4-bit base model quantization with learnable low-rank adapters, reducing memory requirements for training to single-GPU configurations while achieving competitive accuracy with full fine-tuning.

Limitations and Challenges

Accuracy degradation remains the primary challenge, particularly with aggressive quantization schemes below 8 bits. Some model architectures prove more quantization-sensitive than others, with certain attention mechanisms and activation functions showing greater degradation. Outlier activation values—extreme values in particular layers—create challenges for uniform quantization schemes, as scales must accommodate the full range rather than the typical distribution.

Hardware compatibility constraints limit practical quantization benefits. While research supports 2-bit or ternary quantization theoretically, actual hardware kernels may not exist, forcing fallback to higher precision during inference. Effective quantization requires both software support (appropriate frameworks and operators) and hardware acceleration, limiting deployment flexibility.

Calibration data requirements affect post-training quantization quality. Effective calibration requires representative data reflecting the distribution of actual deployment scenarios. Insufficient or unrepresentative calibration data leads to poor quantization parameters and suboptimal runtime performance.