Shared Memory Systems: Concepts and Use Cases

Shared memory systems define a class of computing architectures in which two or more processors or processes access a common memory address space. This page covers the defining characteristics of shared memory, how access coordination operates, the practical scenarios where shared memory is the dominant design choice, and the boundaries that determine when alternative architectures are more appropriate. The subject spans hardware-level multiprocessor designs, operating system primitives, and distributed middleware layers.

Definition and scope

A shared memory system is any architecture where multiple execution units — processors, processor cores, or software processes — read and write to overlapping physical or virtual address ranges without explicit message passing. The IEEE Std 1003.1 (POSIX standard), maintained by the IEEE Standards Association, defines shared memory objects as named memory regions that can be mapped into the address space of cooperating processes, distinguishing them from both private heap allocations and message queues.

The scope of shared memory spans three distinct layers:

  1. Hardware shared memory (UMA and NUMA): Uniform Memory Access (UMA) architectures give every processor socket equal-latency access to all installed DRAM. Non-Uniform Memory Access (NUMA) architectures, standard in multi-socket servers with 2 to 8 sockets, assign each socket a local memory bank and impose higher latency — typically 2× to 3× — on remote bank accesses. Intel's documentation for Xeon Scalable platforms and AMD's EPYC architecture whitepapers both describe NUMA topology configurations in production hardware.

  2. OS-level shared memory: Operating system primitives — POSIX shm_open, System V shmget, and Windows Named Shared Memory — create regions accessible across process boundaries within a single host. These are catalogued in the Linux man-pages project and the Microsoft Windows API documentation.

  3. Distributed shared memory (DSM): Software or hardware middleware presents a shared address space across physically separate nodes in a cluster, despite the absence of a direct physical memory bus. DSM systems face coherence latency penalties that hardware-level UMA does not.

The boundary between shared memory and distributed memory systems is architecturally significant: distributed memory passes explicit messages (MPI model), while shared memory exposes a unified address space.

How it works

Access coordination in shared memory systems depends on two foundational mechanisms: cache coherence protocols and synchronization primitives.

Cache coherence ensures that when one processor modifies a cache line, other processors holding stale copies are either invalidated or updated. The two dominant protocol families are:

The JEDEC Solid State Technology Association publishes memory interface standards — including DDR5 and LPDDR5 specifications — that define the electrical and timing constraints within which coherence protocols must operate.

Synchronization primitives prevent race conditions when concurrent writes target the same address. These include mutexes, semaphores, read-write locks, and atomic compare-and-swap (CAS) operations. The correctness of concurrent access depends on memory consistency models: Total Store Order (TSO), used by x86, permits certain reorderings that a sequentially consistent model (as defined in the C++11 standard, ISO/IEC 14882:2011) prohibits.

For a structured look at latency trade-offs within the broader memory hierarchy, see Memory Hierarchy Explained and Memory Bandwidth and Latency.

Common scenarios

Shared memory is the dominant architecture in four categories of production deployment:

  1. Symmetric multiprocessor (SMP) servers: Database engines, web application servers, and scientific computing jobs that run on a single multi-core host rely on shared DRAM. PostgreSQL's shared buffer pool, for example, is a POSIX shared memory region accessed by all backend worker processes concurrently.

  2. High-performance computing (HPC) intra-node communication: Within a single compute node on an HPC cluster, MPI implementations switch to shared memory transport (bypassing the network) when both communicating ranks reside on the same host, reducing latency below 1 microsecond. This pattern is documented in the Open MPI project technical documentation.

  3. Real-time and embedded systems: Shared memory between a real-time kernel and a general-purpose OS partition (as in PREEMPT-RT Linux or VxWorks dual-partition configurations) enables deterministic inter-partition data exchange without network stack overhead. The AUTOSAR standard for automotive software architectures specifies inter-ECU shared memory communication patterns.

  4. In-process parallel workloads: Multi-threaded applications — video encoders, machine learning inference engines, compilers — share working data through process-private shared memory (the heap and thread stacks), coordinated through language-level atomics and locks.

For workload-specific design patterns, Memory Systems for High-Performance Computing and In-Memory Computing extend these scenarios further.

Decision boundaries

Shared memory is appropriate when execution units are co-located and latency below 100 nanoseconds is operationally necessary. The threshold shifts when:

The foundational reference for memory system classification across all these dimensions is the Memory Systems Authority index, which maps the full taxonomy of memory architectures covered across this reference network.

References