Memory Systems in AI and Machine Learning Workloads

AI and machine learning workloads impose memory demands that differ fundamentally from those of conventional enterprise software — in scale, access pattern, and latency sensitivity. This page covers the classification of memory subsystems used across training and inference pipelines, the mechanisms by which those subsystems constrain or enable model performance, the scenarios where specific configurations are deployed, and the architectural decision boundaries that govern hardware selection. The memory systems landscape for AI workloads spans DRAM, HBM, flash, and distributed fabric technologies, each occupying a distinct performance and cost tier.


Definition and scope

Memory systems in AI and machine learning workloads encompass every layer of the storage hierarchy that holds model parameters, activations, gradients, and dataset batches during computation. The scope extends from on-chip SRAM in tensor processing units (TPUs) through high-bandwidth memory (HBM) stacked on GPU dies, to host DRAM, NVMe-backed virtual memory, and distributed memory fabrics spanning multi-node clusters.

The IEEE Computer Society and JEDEC Solid State Technology Association publish the primary standards governing memory interfaces used in AI accelerators. JEDEC's HBM3 specification (JESD238A), released under the JESD238 family, defines the electrical and protocol requirements for the HBM stacks now standard in Nvidia H100, AMD MI300X, and Google TPU v4 designs. The specification targets per-stack bandwidth exceeding 1 TB/s, a threshold set because large language model (LLM) inference is memory-bandwidth-bound rather than compute-bound at batch sizes below 64.

NIST's foundational memory taxonomy, referenced in (NIST SP 800-193, Platform Firmware Resilience Guidelines), distinguishes volatile from non-volatile memory at the hardware boundary — a distinction that carries direct consequence for checkpoint persistence and failure recovery in training runs.

Understanding how memory hierarchy structures interact with AI accelerator pipelines is prerequisite to evaluating the capacity and throughput claims vendors publish in product datasheets.


How it works

AI training workloads cycle through four memory-intensive phases: data ingestion, forward pass, backward pass (gradient computation), and optimizer state update. Each phase places distinct pressure on different memory tiers.

Phase-by-phase memory demand:

  1. Data ingestion — input batches are staged in host DRAM and transferred via PCIe or NVLink to device memory. PCIe 5.0 provides 128 GB/s bidirectional bandwidth per slot; NVLink 4.0 provides 900 GB/s aggregate between paired GPUs.
  2. Forward pass — activations accumulate in HBM. A 70-billion-parameter transformer model requires approximately 140 GB of HBM to hold parameters in BF16 precision, exceeding the 80 GB HBM2e capacity of a single H100 SXM5.
  3. Backward pass — gradients approximately double peak HBM occupancy before optimizer states are applied, necessitating techniques such as gradient checkpointing, which trades recomputation cycles for memory headroom.
  4. Optimizer state update — Adam optimizer maintains two additional FP32 tensors per parameter (first and second moment), multiplying memory demand by approximately 4× relative to the parameter footprint alone.

Memory bandwidth and latency characteristics at each tier directly determine whether the accelerator is compute-bound or memory-bound during any given phase. The roofline model, a standard analytical framework documented in research published through Lawrence Berkeley National Laboratory's Computational Research Division, maps operations per byte of memory traffic to identify the binding constraint.


Common scenarios

Three deployment scenarios account for the majority of AI memory system designs:

Large-scale distributed training uses tensor parallelism and pipeline parallelism across dozens to thousands of accelerators. Memory is distributed across nodes via RDMA-capable interconnects (InfiniBand HDR at 200 Gb/s or NDR at 400 Gb/s). Each node's HBM holds a shard of the model; distributed memory systems coordinate parameter synchronization through collective communication libraries such as NCCL.

Single-node inference serving concentrates memory pressure on KV-cache management. For a context window of 128,000 tokens with a 70-billion-parameter model in FP16, the KV-cache alone can consume 160 GB of HBM per concurrent request, making in-memory computing techniques such as paged attention (as described in the vLLM research from UC Berkeley, 2023) critical to throughput.

Edge inference on embedded accelerators operates under hard power and thermal envelopes of 5–25W. These deployments rely on quantized models (INT8 or INT4) stored in LPDDR5 or on-chip SRAM, with model footprints compressed from tens of gigabytes to under 4 GB. Memory systems in embedded computing govern the design constraints for this class of deployment.


Decision boundaries

Hardware and software architects encounter four primary decision boundaries when specifying memory systems for AI workloads:

HBM versus GDDR7 — HBM3 delivers higher bandwidth per watt (approximately 3.5 TB/s for an 8-stack configuration) but at 4–6× higher unit cost than GDDR7. GDDR7, standardized under JEDEC JESD239, targets 32 Gb/s per pin, making it viable for inference-focused deployments where bandwidth requirements are moderate and cost-per-unit matters.

On-device versus offloaded storage — When model parameters exceed HBM capacity, virtual memory systems backed by NVMe can extend addressable memory at a latency cost of 50–200 µs per page fault, versus 100–300 ns for HBM access. This trade-off determines whether a model can run on a single node or requires multi-node partitioning.

Volatile versus persistent memoryPersistent memory systems using byte-addressable PMEM (previously Intel Optane DC, now largely discontinued) allowed checkpoint intervals to shrink from minutes to seconds. With Optane discontinued as of 2022 (Intel product discontinuation notice), the field has shifted toward fast NVMe checkpoint-to-storage pipelines as an alternative.

Precision and quantization — Reducing parameter precision from FP32 to FP16 halves HBM occupancy; INT8 quantization reduces it to one-quarter. Memory optimization strategies based on precision reduction are now a standard part of production deployment pipelines, governed by frameworks such as those documented by MLCommons in its MLPerf benchmarking suite.


References