SRAM-Centric Chips for Enterprise Inference

SRAM-centric chips integrate large amounts of high-speed Static RAM (SRAM) directly with compute logic on-chip or via chiplets, minimizing data movement for low-latency AI inference. This approach contrasts with DRAM/HBM-based designs that rely on external high-bandwidth memory stacks, which introduce latency from off-chip access. ¹⁾

Architecture

SRAM-centric designs prioritize low-latency inference over massive capacity by embedding SRAM alongside compute logic. On-chip SRAM achieves bandwidths of up to 150 TB/s, compared to 2-8 TB/s for external HBM stacks. ²⁾

The trade-off is capacity: SRAM is density-limited (typically 256 MB to 44 GB per chip), requiring chiplet pooling or external LPDDR for larger models.

SRAM vs HBM Comparison

Aspect	SRAM-Centric	DRAM/HBM-Based
Memory Bandwidth	Up to 150 TB/s (on-chip)	2-8 TB/s (external stacks)
Capacity	Limited (256 MB - 44 GB/chip)	High (hundreds of GB with HBM4)
Latency	Ultra-low (no off-chip fetches)	Higher due to memory wall
Power/Cost	Lower for inference workloads	Power-hungry, costly HBM integration
Best For	Low-latency enterprise inference (RAG, real-time)	High-throughput training, large models

Companies

d-Matrix — Leads with DIMC (Digital In-Memory Compute) using SRAM-woven logic. The Corsair platform uses 4 chiplets with 1 GB total SRAM plus LPDDR5 off-chip memory, delivering 10x faster and 3x cheaper inference than GPUs for enterprise workloads ³⁾
Cerebras — Early pioneer of SRAM-heavy wafer-scale engines with on-chip memory for full model storage; later added external memory for growing LLM sizes ⁴⁾
Groq — Pioneered SRAM-based LPU for inference but supplemented with external memory for scale ⁵⁾
SambaNova — SN40/SN50 chips with massive on-chip SRAM (up to 44 GB in some configurations) and 1 TB/s bandwidth for enterprise RAG with reconfigurable dataflow ⁶⁾
Marvell — Advancing dense SRAM IP at TSMC 2nm with die-to-die links for custom AI ⁷⁾

Advantages for Low-Latency Inference

On-chip SRAM eliminates the “memory wall” — the bottleneck where processors wait for data from external memory. This is critical for:

LLM decode phase — The latency-critical token generation step where SRAM's low-latency access directly improves tokens-per-second
Real-time applications — RAG-based enterprise assistants, financial trading, autonomous systems
Cost efficiency — SRAM integration avoids expensive HBM packaging

Limitations

SRAM density scaling has stalled at advanced nodes, requiring chiplet pooling for models beyond ~20B parameters
LPDDR supplementation introduces latency penalties for larger contexts
Less flexible than GPU-based systems for diverse workload mixes ⁸⁾

References

¹⁾ , ⁴⁾ , ⁵⁾

Source: The Data Exchange — d-Matrix

²⁾ , ³⁾

Source: Vik's Newsletter — d-Matrix In-Memory Compute

⁶⁾

Source: Intuition Labs — LLM Inference Hardware Guide

⁷⁾

Source: ServeTheHome — Marvell Hot Chips 2025

⁸⁾

Source: Tom's Hardware — Jensen Huang on AI Hardware

AI Agent Knowledge Base

Sidebar

Table of Contents

SRAM-Centric Chips for Enterprise Inference

Architecture

SRAM vs HBM Comparison

Companies

Advantages for Low-Latency Inference

Limitations

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

SRAM-Centric Chips for Enterprise Inference

Architecture

SRAM vs HBM Comparison

Companies

Advantages for Low-Latency Inference

Limitations

See Also

References

Page Tools