====== SRAM-Centric Chips for Enterprise Inference ======

**SRAM-centric chips** integrate large amounts of high-speed Static RAM (SRAM) directly with compute logic on-chip or via chiplets, minimizing data movement for low-latency AI inference. This approach contrasts with DRAM/HBM-based designs that rely on external high-bandwidth memory stacks, which introduce latency from off-chip access. ((Source: [[https://thedataexchange.media/sid-sheth-d-matrix/|The Data Exchange — d-Matrix]]))

===== Architecture =====

SRAM-centric designs prioritize **low-latency inference** over massive capacity by embedding SRAM alongside compute logic. On-chip SRAM achieves bandwidths of up to 150 TB/s, compared to 2-8 TB/s for external HBM stacks. ((Source: [[https://www.viksnewsletter.com/p/d-matrix-in-memory-compute|Vik's Newsletter — d-Matrix In-Memory Compute]]))

The trade-off is capacity: SRAM is density-limited (typically 256 MB to 44 GB per chip), requiring chiplet pooling or external LPDDR for larger models.

===== SRAM vs HBM Comparison =====

^ Aspect ^ SRAM-Centric ^ DRAM/HBM-Based ^
| Memory Bandwidth | Up to 150 TB/s (on-chip) | 2-8 TB/s (external stacks) |
| Capacity | Limited (256 MB - 44 GB/chip) | High (hundreds of GB with HBM4) |
| Latency | Ultra-low (no off-chip fetches) | Higher due to memory wall |
| Power/Cost | Lower for inference workloads | Power-hungry, costly HBM integration |
| Best For | Low-latency enterprise inference (RAG, real-time) | High-throughput training, large models |

===== Companies =====

  * **d-Matrix** — Leads with DIMC (Digital In-Memory Compute) using SRAM-woven logic. The Corsair platform uses 4 chiplets with 1 GB total SRAM plus LPDDR5 off-chip memory, delivering 10x faster and 3x cheaper inference than GPUs for enterprise workloads ((Source: [[https://www.viksnewsletter.com/p/d-matrix-in-memory-compute|Vik's Newsletter — d-Matrix In-Memory Compute]]))
  * **Cerebras** — Early pioneer of SRAM-heavy wafer-scale engines with on-chip memory for full model storage; later added external memory for growing LLM sizes ((Source: [[https://thedataexchange.media/sid-sheth-d-matrix/|The Data Exchange — d-Matrix]]))
  * **Groq** — Pioneered SRAM-based LPU for inference but supplemented with external memory for scale ((Source: [[https://thedataexchange.media/sid-sheth-d-matrix/|The Data Exchange — d-Matrix]]))
  * **SambaNova** — SN40/SN50 chips with massive on-chip SRAM (up to 44 GB in some configurations) and 1 TB/s bandwidth for enterprise RAG with reconfigurable dataflow ((Source: [[https://intuitionlabs.ai/articles/llm-inference-hardware-enterprise-guide|Intuition Labs — LLM Inference Hardware Guide]]))
  * **Marvell** — Advancing dense SRAM IP at TSMC 2nm with die-to-die links for custom AI ((Source: [[https://www.servethehome.com/marvell-shows-dense-sram-custom-hbm-and-cxl-with-arm-compute-at-hot-chips-2025/|ServeTheHome — Marvell Hot Chips 2025]]))

===== Advantages for Low-Latency Inference =====

On-chip SRAM eliminates the "memory wall" — the bottleneck where processors wait for data from external memory. This is critical for:

  * **LLM decode phase** — The latency-critical token generation step where SRAM's low-latency access directly improves tokens-per-second
  * **Real-time applications** — RAG-based enterprise assistants, financial trading, autonomous systems
  * **Cost efficiency** — SRAM integration avoids expensive HBM packaging

==== Limitations ====

  * SRAM density scaling has stalled at advanced nodes, requiring chiplet pooling for models beyond ~20B parameters
  * LPDDR supplementation introduces latency penalties for larger contexts
  * Less flexible than GPU-based systems for diverse workload mixes ((Source: [[https://www.tomshardware.com/tech-industry/nvidia-ceo-jensen-huang-makes-the-case-against-optimizing-ai-hardware-too-narrowly-at-ces|Tom's Hardware — Jensen Huang on AI Hardware]]))

===== See Also =====

  * [[hbm4e_memory|HBM4E Memory]]
  * [[analog_ai_chip|Analog AI Chips]]
  * [[ai_native_chiplet|AI-Native Chiplet Architecture]]

===== References =====