High Bandwidth Memory (HBM): Architecture and Industry Use

High Bandwidth Memory (HBM) is a specialized DRAM architecture designed to deliver extreme memory bandwidth by stacking multiple memory dies vertically and connecting them to a logic die or processor through a silicon interposer. This page covers HBM's physical structure, generational standards, operational role in compute-intensive workloads, and the technical boundaries that determine when HBM is the appropriate memory choice versus competing alternatives. HBM has become a foundational technology in AI accelerators, high-performance computing (HPC) platforms, and advanced graphics processors, where memory bandwidth and latency constraints directly limit system throughput.


Definition and scope

HBM is defined by JEDEC — the standards body responsible for semiconductor memory specifications — in the JESD235 standard series, which governs electrical, physical, and interface requirements for each HBM generation. The core architecture stacks between 4 and 16 DRAM dies vertically, interconnected by thousands of Through-Silicon Vias (TSVs), and mounted alongside the host processor on a 2.5D silicon interposer. This interposer-based integration is what distinguishes HBM from conventional DRAM packages mounted on a PCB.

The scope of HBM as a product category spans four commercially relevant generations:

  1. HBM (Gen 1) — ratified under JESD235; delivered up to 128 GB/s per stack at 1 Gbps per pin.
  2. HBM2 — ratified under JESD235A; doubled per-pin bandwidth to 2 Gbps, with stacks reaching up to 8 GB capacity.
  3. HBM2E — an extension of JESD235A; pushed per-stack bandwidth to approximately 460 GB/s and capacity to 16 GB per stack.
  4. HBM3 — ratified under JESD235D; per-stack bandwidth exceeds 819 GB/s, with the HBM3E variant targeting over 1.2 TB/s per stack in leading implementations.

JEDEC publishes the full JESD235 series documentation through its official standards portal.

The reference landscape for memory systems standards and specifications places HBM within a broader ecosystem of JEDEC-governed interfaces that also includes DDR5, LPDDR5, and GDDR7, each targeting distinct bandwidth-capacity-power trade-off profiles.


How it works

HBM achieves its bandwidth advantage through three structural mechanisms operating simultaneously.

TSV interconnects and wide I/O: Each DRAM die in a stack contains thousands of TSVs — vertical copper pillars etched through the silicon — that carry data between layers. HBM3 stacks expose a 1,024-bit-wide memory bus per stack, compared to the 64-bit-wide bus on a single DDR5 channel. This bus width is the primary source of bandwidth gain, not raw clock speed.

Silicon interposer integration: The host processor (GPU, ASIC, or HPC chip) and HBM stacks are co-packaged on a passive silicon interposer, which provides dense, short-trace connections between the processor's memory controller and the HBM interface. Trace lengths of under 1 millimeter reduce signal integrity constraints and allow the wide bus to operate at lower voltages than PCB-routed DRAM, cutting power consumption per bit transferred.

Stacked die architecture: A base logic die handles row/column addressing, refresh logic, and ECC circuitry, while stacked DRAM core dies provide raw storage capacity. The base die interfaces with the interposer through microbumps spaced at pitches as fine as 55 micrometers in current-generation packages.

The combined effect is that a single HBM3E stack can deliver over 1.2 TB/s of aggregate bandwidth at a power envelope measured in watts — a density that conventional GDDR or DDR implementations cannot match at equivalent die area. The memory hierarchy explained context positions HBM as an on-package last-level memory tier, sitting between on-chip SRAM caches and off-package storage.


Common scenarios

HBM appears in production deployments across five identifiable workload categories:

  1. AI accelerators — NVIDIA's H100 GPU uses 80 GB of HBM3 across 5 stacks, delivering 3.35 TB/s aggregate bandwidth (NVIDIA H100 Tensor Core GPU Architecture whitepaper). AMD's MI300X uses 192 GB of HBM3 across 8 stacks.
  2. HPC and scientific simulation — Supercomputers including Frontier (Oak Ridge National Laboratory) deploy AMD Instinct MI250X accelerators with HBM2E, where memory bandwidth is the binding constraint in fluid dynamics and molecular modeling workloads.
  3. Network switching ASICs — Broadcom's Tomahawk and Jericho switch chip families integrate HBM for packet buffering, where sustained bandwidth demands exceed what DDR offers at acceptable latency.
  4. Graphics processing — AMD's Radeon Pro W7900 and professional visualization cards use HBM for framebuffer and compute contexts where GDDR bandwidth ceilings create rendering bottlenecks.
  5. Custom and government ASICs — Intel's Ponte Vecchio (Xe HPC) integrates HBM2E stacks, and programs under the US Department of Energy's Exascale Computing Project specified HBM as a minimum capability requirement for qualifying accelerators.

Memory systems for high-performance computing covers the interplay between HBM capacity, interconnect topology, and node-level memory pressure in cluster environments.


Decision boundaries

HBM is not the default choice for general-purpose applications. The selection criteria that favor HBM over DDR5 or GDDR7 follow a structured set of technical and economic thresholds:

Workloads that do not saturate memory bandwidth — transactional databases, web servers, general-purpose CPUs — receive no measurable benefit from HBM and incur cost and integration complexity without return. Memory bottlenecks and solutions provides the diagnostic framework for identifying whether bandwidth or latency is the actual limiting factor before HBM selection is warranted.

The complete reference index for memory technology classification is available at Memory Systems Authority.


References