GPU Memory Architecture: VRAM, GDDR, and Compute Workloads

GPU memory architecture defines how graphics processors store, access, and move data across dedicated on-chip and off-chip memory subsystems. This page covers the structural composition of VRAM, the GDDR and HBM memory standards that dominate discrete GPU designs, and the performance tradeoffs that govern compute-intensive workloads in gaming, machine learning inference, and scientific simulation. Understanding where GPU memory fits within the broader memory hierarchy is essential for professionals specifying hardware for throughput-sensitive applications.


Definition and scope

Video RAM (VRAM) refers to the dedicated memory pool attached directly to a GPU die, operating independently of system DRAM. Unlike general-purpose RAM that serves the CPU's sequential access patterns, VRAM is engineered for massive parallel bandwidth — the ability to feed thousands of shader cores simultaneously with texture data, framebuffer contents, activation tensors, or working matrices.

The two dominant VRAM interface families are GDDR (Graphics Double Data Rate) and HBM (High Bandwidth Memory). JEDEC Solid State Technology Association maintains the standards for both families; GDDR6 and GDDR6X specifications are published under JEDEC standards JESD250 and related documents (JEDEC), while HBM2E and HBM3 specifications are governed under JEDEC JESD235 (JEDEC HBM).

GDDR memory connects to the GPU through a wide parallel bus — 256-bit, 320-bit, or 384-bit interfaces being common in discrete consumer and professional cards — whereas HBM stacks DRAM dies vertically on a silicon interposer adjacent to the GPU, enabling interface widths of 1,024 bits per stack.


How it works

VRAM operates through a memory controller integrated into the GPU die. The controller arbitrates read and write requests from thousands of concurrent shader threads, uses coalescing logic to merge adjacent memory accesses into single transactions, and manages row activation in DRAM arrays to minimize latency penalties.

The bandwidth advantage of GDDR6X relative to GDDR6 derives from PAM4 (Pulse Amplitude Modulation, 4-level) signaling, which encodes 2 bits per signal transition rather than 1, effectively doubling data throughput per pin without increasing clock frequency. NVIDIA's GeForce RTX 4090 implements 24 GB of GDDR6X across a 384-bit bus, yielding a published memory bandwidth of 1,008 GB/s (NVIDIA).

HBM architecture eliminates the need for a wide PCB trace bus by using through-silicon vias (TSVs) to stack up to 12 DRAM dies vertically. Each HBM3 stack delivers up to 819 GB/s of bandwidth per stack (JEDEC JESD238), making it the dominant choice for data center accelerators where die area and thermal density constrain traditional GDDR routing.

The L2 cache layer sitting between shader cores and VRAM plays a critical role: larger L2 caches reduce DRAM access frequency. AMD's RDNA 3 architecture, for instance, implements 96 MB of Infinity Cache on select SKUs to mitigate effective bandwidth requirements at lower memory bus widths.


Common scenarios

GPU memory architecture choices manifest differently across three primary deployment contexts:

  1. Real-time rendering and gaming — Framebuffer storage, texture atlases, and shadow maps consume VRAM in proportion to resolution and asset fidelity. At 4K with high-quality textures, scenes can exceed 12 GB of VRAM allocation. The memory systems for gaming reference covers platform-specific constraints in detail.

  2. Machine learning training — Transformer model weights, gradient tensors, and optimizer states must all reside in VRAM simultaneously during backward passes. A 70-billion-parameter language model in 16-bit precision requires approximately 140 GB of VRAM, exceeding the capacity of any single consumer GPU and driving multi-GPU NVLink or HBM-based data center solutions.

  3. Scientific and HPC simulation — Computational fluid dynamics, molecular dynamics, and climate modeling workloads sustain near-peak memory bandwidth for extended periods. AMD Instinct MI300X, deploying 192 GB of HBM3, targets this segment (AMD).

Professionals evaluating these scenarios should also consult the memory systems for high-performance computing reference for bandwidth and latency benchmarking methodology.


Decision boundaries

Selecting between GDDR and HBM involves tradeoffs across five measurable dimensions:

Dimension GDDR6 / GDDR6X HBM2E / HBM3
Bandwidth per device 384–1,008 GB/s 1,638+ GB/s (2× stacks)
Capacity ceiling (single GPU) 24 GB (GDDR6X) 192 GB (HBM3, MI300X)
Manufacturing cost Lower Higher (interposer required)
Board form factor Standard PCIe Requires 2.5D packaging
Power efficiency (GB/s per watt) Moderate High

GDDR6X is appropriate where cost-per-unit and standard board dimensions are constraints and bandwidth requirements fall below 1 TB/s. HBM is the correct choice when capacity exceeds 48 GB, sustained bandwidth above 1.6 TB/s is required, or thermal density on the board prohibits wide GDDR bus routing.

Memory bandwidth and latency profiling is the standard diagnostic step before specifying GPU memory tier for a new workload. Workloads that are compute-bound rather than memory-bound derive limited benefit from HBM's bandwidth premium. The memory bottlenecks and solutions reference documents the diagnostic methods used to classify workload sensitivity.

The full landscape of compute and storage memory types, including non-GPU volatile subsystems, is catalogued at the Memory Systems Authority.


References