Table of Contents

SRAM-Centric Chips for Enterprise Inference

SRAM-centric chips integrate large amounts of high-speed Static RAM (SRAM) directly with compute logic on-chip or via chiplets, minimizing data movement for low-latency AI inference. This approach contrasts with DRAM/HBM-based designs that rely on external high-bandwidth memory stacks, which introduce latency from off-chip access. 1)

Architecture

SRAM-centric designs prioritize low-latency inference over massive capacity by embedding SRAM alongside compute logic. On-chip SRAM achieves bandwidths of up to 150 TB/s, compared to 2-8 TB/s for external HBM stacks. 2)

The trade-off is capacity: SRAM is density-limited (typically 256 MB to 44 GB per chip), requiring chiplet pooling or external LPDDR for larger models.

SRAM vs HBM Comparison

Aspect SRAM-Centric DRAM/HBM-Based
Memory Bandwidth Up to 150 TB/s (on-chip) 2-8 TB/s (external stacks)
Capacity Limited (256 MB - 44 GB/chip) High (hundreds of GB with HBM4)
Latency Ultra-low (no off-chip fetches) Higher due to memory wall
Power/Cost Lower for inference workloads Power-hungry, costly HBM integration
Best For Low-latency enterprise inference (RAG, real-time) High-throughput training, large models

Companies

Advantages for Low-Latency Inference

On-chip SRAM eliminates the “memory wall” — the bottleneck where processors wait for data from external memory. This is critical for:

Limitations

See Also

References