====== SRAM-Centric Chips for Enterprise Inference ====== **SRAM-centric chips** integrate large amounts of high-speed Static RAM (SRAM) directly with compute logic on-chip or via chiplets, minimizing data movement for low-latency AI inference. This approach contrasts with DRAM/HBM-based designs that rely on external high-bandwidth memory stacks, which introduce latency from off-chip access. ((Source: [[https://thedataexchange.media/sid-sheth-d-matrix/|The Data Exchange — d-Matrix]])) ===== Architecture ===== SRAM-centric designs prioritize **low-latency inference** over massive capacity by embedding SRAM alongside compute logic. On-chip SRAM achieves bandwidths of up to 150 TB/s, compared to 2-8 TB/s for external HBM stacks. ((Source: [[https://www.viksnewsletter.com/p/d-matrix-in-memory-compute|Vik's Newsletter — d-Matrix In-Memory Compute]])) The trade-off is capacity: SRAM is density-limited (typically 256 MB to 44 GB per chip), requiring chiplet pooling or external LPDDR for larger models. ===== SRAM vs HBM Comparison ===== ^ Aspect ^ SRAM-Centric ^ DRAM/HBM-Based ^ | Memory Bandwidth | Up to 150 TB/s (on-chip) | 2-8 TB/s (external stacks) | | Capacity | Limited (256 MB - 44 GB/chip) | High (hundreds of GB with HBM4) | | Latency | Ultra-low (no off-chip fetches) | Higher due to memory wall | | Power/Cost | Lower for inference workloads | Power-hungry, costly HBM integration | | Best For | Low-latency enterprise inference (RAG, real-time) | High-throughput training, large models | ===== Companies ===== * **d-Matrix** — Leads with DIMC (Digital In-Memory Compute) using SRAM-woven logic. The Corsair platform uses 4 chiplets with 1 GB total SRAM plus LPDDR5 off-chip memory, delivering 10x faster and 3x cheaper inference than GPUs for enterprise workloads ((Source: [[https://www.viksnewsletter.com/p/d-matrix-in-memory-compute|Vik's Newsletter — d-Matrix In-Memory Compute]])) * **Cerebras** — Early pioneer of SRAM-heavy wafer-scale engines with on-chip memory for full model storage; later added external memory for growing LLM sizes ((Source: [[https://thedataexchange.media/sid-sheth-d-matrix/|The Data Exchange — d-Matrix]])) * **Groq** — Pioneered SRAM-based LPU for inference but supplemented with external memory for scale ((Source: [[https://thedataexchange.media/sid-sheth-d-matrix/|The Data Exchange — d-Matrix]])) * **SambaNova** — SN40/SN50 chips with massive on-chip SRAM (up to 44 GB in some configurations) and 1 TB/s bandwidth for enterprise RAG with reconfigurable dataflow ((Source: [[https://intuitionlabs.ai/articles/llm-inference-hardware-enterprise-guide|Intuition Labs — LLM Inference Hardware Guide]])) * **Marvell** — Advancing dense SRAM IP at TSMC 2nm with die-to-die links for custom AI ((Source: [[https://www.servethehome.com/marvell-shows-dense-sram-custom-hbm-and-cxl-with-arm-compute-at-hot-chips-2025/|ServeTheHome — Marvell Hot Chips 2025]])) ===== Advantages for Low-Latency Inference ===== On-chip SRAM eliminates the "memory wall" — the bottleneck where processors wait for data from external memory. This is critical for: * **LLM decode phase** — The latency-critical token generation step where SRAM's low-latency access directly improves tokens-per-second * **Real-time applications** — RAG-based enterprise assistants, financial trading, autonomous systems * **Cost efficiency** — SRAM integration avoids expensive HBM packaging ==== Limitations ==== * SRAM density scaling has stalled at advanced nodes, requiring chiplet pooling for models beyond ~20B parameters * LPDDR supplementation introduces latency penalties for larger contexts * Less flexible than GPU-based systems for diverse workload mixes ((Source: [[https://www.tomshardware.com/tech-industry/nvidia-ceo-jensen-huang-makes-the-case-against-optimizing-ai-hardware-too-narrowly-at-ces|Tom's Hardware — Jensen Huang on AI Hardware]])) ===== See Also ===== * [[hbm4e_memory|HBM4E Memory]] * [[analog_ai_chip|Analog AI Chips]] * [[ai_native_chiplet|AI-Native Chiplet Architecture]] ===== References =====