Model compression and quantization refer to a set of techniques designed to reduce the computational requirements, memory footprint, and inference latency of machine learning models, particularly large language models and neural networks. These methods enable deployment of sophisticated models on resource-constrained hardware, including edge devices and single-GPU systems, without substantial degradation in model performance. Compression and quantization have become essential practices in making advanced AI systems practical for real-world deployment scenarios.
Modern deep learning models, especially large language models (LLMs), often contain billions or trillions of parameters, making them computationally expensive and memory-intensive to deploy. A single inference pass may require gigabytes of memory and significant computational resources, limiting accessibility and increasing operational costs. Model compression addresses this challenge through various mathematical and algorithmic techniques that reduce model size while preserving functional capacity 1).
Quantization specifically refers to the process of reducing the precision of numerical representations in models—converting from higher-precision formats like FP32 (32-bit floating-point) to lower-precision formats such as INT8 (8-bit integer) or FP8 (8-bit floating-point). This reduction in numerical precision directly decreases memory requirements and can dramatically accelerate computation through optimized low-precision hardware operations 2).
Knowledge Distillation is a foundational compression approach where a smaller “student” model learns to replicate the behavior of a larger “teacher” model through a training process that minimizes the divergence between their output distributions. This technique transfers learned representations and generalizations from the teacher to the student, resulting in a compact model with performance characteristics approaching the original 3).
Mixture of Experts (MoE) Optimization involves selective activation of model parameters during inference. Rather than using all parameters for every token, MoE architectures employ a gating mechanism that routes different inputs to specialized subsets of model parameters. Optimization of these architectures can involve pruning unused experts, consolidating expert capacity, and quantizing expert-specific parameters, reducing total computational load while maintaining model capability 4).
KV Cache Quantization addresses a specific bottleneck in transformer-based language models. During autoregressive text generation, the model must store and reuse key-value pairs from previous tokens, creating a memory-intensive data structure called the KV cache. FP8 KV cache quantization reduces these tensors from FP32 or FP16 precision to 8-bit floating-point format, significantly decreasing memory overhead. Recent implementations demonstrate that FP8 KV cache combined with hybrid attention patterns can enable deployment of very large models on single-GPU systems without substantial quality loss.
Pruning involves removing weights, neurons, or entire layers determined to be less important for model performance. Structured pruning eliminates entire components (such as attention heads), while unstructured pruning removes individual weights. Post-training pruning can recover significant model size reductions with minimal retraining requirements.
Model compression techniques enable several important deployment scenarios:
* Edge Deployment: Compressed models can run on mobile devices and edge servers with limited computational capacity and power budgets, enabling on-device inference without cloud connectivity.
* Cost Reduction: Quantized models require less memory bandwidth and computation time, directly reducing operational costs for large-scale inference services through lower energy consumption and hardware requirements.
* Latency Improvement: Reduced model size and simplified arithmetic operations decrease inference latency, critical for latency-sensitive applications like real-time chatbots and interactive systems.
* Single-GPU Deployment: Hybrid approaches combining FP8 KV cache, MoE optimization, and attention pattern engineering have demonstrated viability for deploying models with tens of billions of parameters on individual consumer-grade GPUs.
Compression techniques introduce trade-offs between model size/speed and performance quality. Aggressive quantization can degrade model capabilities, particularly for complex reasoning tasks. The degree of precision loss varies by task—some applications prove robust to low-precision representations while others require higher precision in specific layers or operations.
Calibration quality significantly affects quantized model performance. Post-training quantization requires selecting appropriate scaling factors and quantization ranges, often determined through calibration on representative data samples. Mismatch between calibration distribution and deployment distribution can lead to suboptimal performance.
Dynamic quantization during inference presents additional complexity, as the optimal quantization parameters may vary per token or per layer. Adaptive quantization schemes must balance between maintaining model quality and computational efficiency.