FP4 Quantization refers to a 4-bit floating-point format employed in modern large language models to reduce memory requirements while preserving computational performance. This quantization scheme represents a specialized approach to model compression, particularly valuable for storing mixture-of-experts (MoE) weights and managing key-value (KV) cache memory in transformer-based architectures. FP4 formatting enables significant reductions in model footprint without proportional degradation in inference quality, making it essential for deploying large-scale language models in resource-constrained environments.
FP4 quantization uses a 4-bit floating-point representation to encode numerical values across model parameters. This format belongs to the family of sub-byte quantization techniques that compress full-precision (typically FP32 or BF16) weights into lower-bit representations 1). The 4-bit floating-point format allocates bits across sign, exponent, and mantissa components in a manner optimized for neural network weight distributions, which typically exhibit non-uniform value ranges across model layers.
The quantization process involves mapping high-precision weights to lower-precision representations through scaling and rounding operations. FP4 formats generally reserve fewer bits for the exponent and mantissa compared to standard IEEE floating-point representations, trading precision for memory efficiency. This design reflects empirical findings that neural networks demonstrate robustness to quantization-induced approximation errors, particularly when gradual layer-wise calibration is performed during conversion 2).
FP4 quantization demonstrates particular utility in mixture-of-experts (MoE) architectures, which employ multiple specialized neural network pathways (experts) with routing mechanisms that selectively activate subsets of experts for each input token. MoE models require substantial parameter storage due to their expert redundancy, making compression techniques critical for practical deployment. By applying FP4 quantization to expert weights, practitioners achieve approximately 50% reduction in memory footprint without requiring changes to routing logic or activation functions 3). This compression proves especially valuable when models maintain extensive context windows or require deployment across distributed inference clusters.
The key-value (KV) cache—which stores computed attention keys and values across previous tokens to accelerate sequential token generation—represents a significant memory bottleneck in transformer inference. At large batch sizes or extended context windows, KV cache memory consumption can exceed model weight storage requirements. FP4 quantization applied to KV cache entries reduces this memory demand substantially while maintaining attention computation accuracy. The reduced cache size enables longer context windows, larger batch sizes, or deployment on hardware with tighter memory constraints.
The effectiveness of FP4 quantization for KV cache relies on empirical observation that attention mechanisms tolerate quantization noise in key-value representations better than input projection layers. Systematic studies of post-training quantization demonstrate that late-layer attention computations exhibit robustness to sub-byte precision when quantization parameters are calibrated on representative data 4).
The practical benefits of FP4 quantization depend on hardware support for efficient sub-byte floating-point operations. Modern accelerators designed for inference—including specialized quantization-aware processors and optimized kernel implementations—provide native or near-native performance for 4-bit operations. However, systems lacking specialized hardware support may incur computational overhead that partially offsets memory savings. Quantization-aware training, where models are trained with quantization operations simulated in the forward pass, produces superior results compared to post-training quantization alone, though it requires significant computational investment 5).
FP4 quantization introduces several technical challenges. First, the reduced numerical precision can compound in iterative computations or chain-of-thought reasoning sequences requiring multiple inference steps. Second, certain model architectures prove more amenable to 4-bit quantization than others; models with highly specialized weight distributions may require layer-specific bit-width adjustments. Third, quantization calibration requires representative data samples that accurately reflect deployment distributions, and miscalibration substantially degrades model performance. Additionally, not all hardware platforms provide efficient execution of 4-bit floating-point operations, potentially limiting practical deployment scenarios.