Manage Your Memory Address Space with OpenSHMEM*

DavidOzog · ‎04-19-2023

Manage Your Memory Address Space with OpenSHMEM*

Distributed Communication Blog 2

David Ozog

Middleware Software Engineer, Intel Corp.

Larry Stewart

Middleware Software Engineer, Intel Corp.

Why Shared Memory

There are two primary ways to take advantage of parallelism for execution on distributed memory systems:

Message Passing
Shared Memory

Most hardware configurations that support large scale distributed computing do not have a physical shared memory. Shared memory is a software abstraction layer that maps a shared memory layer across the disjoint memories distributed across different computers. However, shared memory as a model of computation is a very familiar concept to programmers. Using this approach, a number of programming models have evolved to support applications written in a shared memory style, which can run on hardware with hundreds or thousands of different computers.

In this blog, we will talk about one such programming model:

Partitioned Global Address Space to the Rescue

The parallel programming model we will discuss in this blog is a Partitioned Global Address Space (PGAS). PGAS hides the fact that the underlying hardware consists of separate memories, which makes distributed programming feel more like shared memory programming.

The memories distributed about the nodes of the distributed system are partitioned into a globally visible memory and a local memory. The abstraction makes clear which memory is local and which memory is remote. The basic idea is to create objects in the globally visible, shared memory partition collectively across the system. Then programs running on each node pull objects from the shared memory managed that is close to them, does the computation locally, and then puts the results back into the global, shared memory.

A number of programming environments directly support PGAS programming. Unified Parallel C, for example, is similar to C, and Titanium is similar to Java. PGAS can, however, also be added to classic languages by means of a subroutine library. That is what SHMEM is (SHMEM stands for SHared MEMory). It is a subroutine library for PGAS programming. You can use it from C or C++ or FORTRAN, or any language for which SHMEM bindings exist.

Let us talk a bit more about OpenSHMEM, an open-source specification and implementation for SHMEM. OpenSHMEM is an effort to create a a standardized API for parallel programming in a Partitioned Global Address Space.

What Is PGAS Anyway?

A PGAS program almost always executes through the SPMD execution model where SPMD stands for “Same Program, Multiple Data.” Imagine N copies of the same executable, started at the same time on multiple computers. Initialization code puts all these copies in touch with one another and assigns each copy a unique processing element (PE) number. This is the same idea as a rank in the Message Passing Interface (MPI) standard.

Figure 1: Organization of Workload into Multiple Processing Elements (PE)

The initialization code then sets up a global memory space, partitioned so that each PE owns a specified piece of it. Figure 1 shows a typical case, in which some global 3D data array is broken up into six pieces, one hosted by each PE.

A PE can make local references to its own portion of the global space but must make “Put” and “Get” library calls to write and read remote portions of the global space. The "Put" and "Get" routines implement a one-sided communication protocol. This is critical as the model maps onto a wide range of systems following different models of how concurrent tasks are scheduled (including the very different models on a CPU or on a GPU).

All PEs execute the same program. The execution for each PE is controlled by if-then blocks, which test the PE number or by indexing with respect to its own PE number.

To demonstrate this point, a very simple OpenSHMEM program is shown in Figure 2.

Each copy of the program simply prints its own ID.

The general outline of routines in OpenSHMEM is:

Initialization and Query

shmem_init() sets up symmetric data regions
shmem_n_pes()queries the total number of executed PEs
shmem_my_pe() queries the PE number (or ID) of “this” process

Memory Management

“Symmetric” memory is remotely accessible
shmem_{malloc|calloc|realloc}

Remote Memory Access (RMA):

Puts / Gets / Put-with-signal

Atomic Memory Operations (AMO):

increment, compare-swap, fetch, bitwise, etc.

Collective Operations:

Barrier, Broadcast, reduction, all-to-all, etc.

Memory Ordering:

OpenSHMEM ops are unordered. The user needs Fence to order, Quiet to complete.

#include <shmem.h>
#include <stdio.h>

int main(int argc, char* argv[])
{
    shmem_init();
    int npes = shmem_n_pes();
    int me = shmem_my_pe();
    printf(“Hello, World! I am PE %d of %d\n”, me, npes);
shmem_finalize();
    return 0;
}

Figure 2: A “Hello, World” OpenSHMEM Program

What Memory Can Be Accessed Remotely?

We’ve already talked about Initialization and hinted at Remote Memory Access.

In a way, the most unusual aspect of OpenSHMEM is that not all memory is remotely accessible. The part of each program’s memory that is remotely accessible is called the Symmetric Region.

This region itself has two parts to it: global/static variables and the symmetric heap.

The rest of data memory, such as thread stacks and general heap allocated objects, are not remotely accessible.

This design choice is grounded in pragmatism. To enable fast remote access, the underlying memory must be registered within the OpenSHMEM runtime, and the runtime must know the remote address of every symmetric object.

Global Variables

For global variables (and static variables), these have a fixed size and location, so they are easy to register and are remotely accessible by the OpenSHMEM APIs. Each copy of the program can refer to a remote value of a global variable, by using the remote Processing Element (PE) number and the local address of the same global variable.

Symmetric Heap

The symmetric heap is a little more involved. During initialization, each PE allocates and registers its own local instance of the symmetric heap. These regions have the same size on all PEs but may have different virtual addresses.

The OpenSHMEM runtime system will keep track of the symmetric heap base address on each PE, so the user doesn’t need to worry about that.

The next trick in OpenSHMEM’s implementation is that the allocation of objects within the symmetric heap is a collective operation. This means that every PE allocates the same size objects in the same order, so that the offset of such an object within the symmetric heap is the same on every PE. An access to an object in the symmetric heap of a remote PE is done using the remote PE number and the local address of the same object.

Figure 3: Block Diagram of Heap Allocation across Multiple PEs

The example code in the following example illustrates how remote objects can be addressed, allocated, and deallocated in the heap:

Figure 4: Remote Object Access using SHMEM

As most programmers taking advantage of thread parallelism know, if you have multiple threads modifying memory, you must use locks or waits to keep them out of each other’s way. Alternatively, you can use atomic operations to ensure that results are correct even if multiple threads are doing things simultaneously.

The same issues arise with OpenSHMEM. The only difference is that the threads are running on different computers.

For that reason, OpenSHMEM supports remote atomic operations and a number of synchronization mechanisms.

In distributed parallel programming, synchronization across PEs can be enforced through a number of very common execution flow patterns, such as OpenSHMEM’s barrier, in which all PEs wait at a certain point in the code until every PE has finished its previous work. This is similar to a thread barrier (e.g., std::barrier) in same-CPU parallelism, except that the OpenSHMEM barrier must complete all pending communication across all PEs before continuing with the program.

OpenSHMEM additionally provides collective operations, in which a collection of PEs contribute to a single aggregate operation. For example, one PE can broadcast its data to all other PEs. Alternatively, every PE can contribute its data to a reduction operation, for example to calculate a statistical average quantity across all PEs. OpenSHMEM supports many collective operations that are beyond the scope of this discussion, but more information can be found in the OpenSHMEM v1.5 specification Section 9.9.

Next Steps

Check out OpenSHMEM for your next distributed compute project involving the need for partitioned shared memory management. Give it a try today! A good place to start is by following the “Getting Started Guide” for Sandia* OpenSHMEM.

Find out more about Intel supports open industry standards for distributed computing with the Intel® oneAPI HPC Toolkit.

Intel actively contributes to the OpenSHMEM project and enables OpenSHMEM Code Analysis with Intel® VTune™ Profiler. We will leave that discussion for another blog.

Enable your program to scale across multiple compute nodes and be easily portable to a multitude of hardware and cluster configurations!

Manage Your Memory Address Space with OpenSHMEM*