Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
SRAM-centric chips integrate large amounts of high-speed Static RAM (SRAM) directly with compute logic on-chip or via chiplets, minimizing data movement for low-latency AI inference. This approach contrasts with DRAM/HBM-based designs that rely on external high-bandwidth memory stacks, which introduce latency from off-chip access. 1)
SRAM-centric designs prioritize low-latency inference over massive capacity by embedding SRAM alongside compute logic. On-chip SRAM achieves bandwidths of up to 150 TB/s, compared to 2-8 TB/s for external HBM stacks. 2)
The trade-off is capacity: SRAM is density-limited (typically 256 MB to 44 GB per chip), requiring chiplet pooling or external LPDDR for larger models.
| Aspect | SRAM-Centric | DRAM/HBM-Based |
|---|---|---|
| Memory Bandwidth | Up to 150 TB/s (on-chip) | 2-8 TB/s (external stacks) |
| Capacity | Limited (256 MB - 44 GB/chip) | High (hundreds of GB with HBM4) |
| Latency | Ultra-low (no off-chip fetches) | Higher due to memory wall |
| Power/Cost | Lower for inference workloads | Power-hungry, costly HBM integration |
| Best For | Low-latency enterprise inference (RAG, real-time) | High-throughput training, large models |
On-chip SRAM eliminates the “memory wall” — the bottleneck where processors wait for data from external memory. This is critical for: