Introducing the New Intel® SHMEM

LawrenceStewart · ‎01-10-2024

Distributed Communications Blog Introducing Intel® SHMEM

Intel’s OpenSHMEM Team

by

Md Wasi-Ur Rahman, Lawrence Stewart, David Ozog

Senior Software Development Engineers

Intel Corporation

Introduction

This post introduces Intel SHMEM, which extends the OpenSHMEM* programming model to support Intel® Data Center GPUs using the SYCL cross-platform C++ programming environment.

We are excited to announce the first open-source release of Intel® SHMEM v1.0.0.

Graphics Processing Units (GPUs) are widely used now in High Performance Computing (HPC) and Artificial Intelligence (AI) applications. A GPU such as the Intel® Data Center Max 1550 provides thousands of parallel threads and enormous memory bandwidth. These provide a great opportunity to offload computations from a CPU.

Suppose you have systems with multiple GPUs plugged in. In that case, you can write a program to use them all, by treating them as independent devices or by using high performance communication libraries like the Intel® MPI Library, oneAPI Collective Communications Library (oneCCL), or now, Intel® SHMEM.

The Intel MPI Library and oneCCL are host-centric in that they are callable from the host CPU code, but they can work with data both in the host and in device memory.

Intel SHMEM can do that, too, but it adds the ability to call communication functions from SYCL kernels running on the GPU itself. This permits fine-grain overlap between the computation and communication and, in many cases, can avoid the overheads of making multiple kernel calls.

Other accelerator vendors offer similar capabilities optimized for their respective hardware platforms (e.g., NVSHMEM* from NVIDIA* and ROC_SHMEM from AMD*).

This post will briefly describe Intel SHMEM and its use, but first, we’ll briefly detour into OpenSHMEM, GPUs, and SYCL.

OpenSHMEM

OpenSHMEM is a library interface standard that enables Single Program Multiple Data (SPMD) style programming of distributed memory systems. SPMD allows users to write a single program that executes many copies of the program across a supercomputer or cluster of computers. The copies are distinguished by a unique processing element (PE) identifier number, 0 through N-1, where N is the total number of copies. OpenSHMEM provides several useful operations for efficient distributed programming:

remote memory access (RMA)
atomic memory operations (AMOs)
synchronization
collective operations

OpenSHMEM programming was described in a previous post on this blog series on distributed computing titled “Manage Your Memory Space with OpenSHMEM.”

OpenSHMEM provides a comprehensive specification document at version 1.5 at the time of this writing. Several implementations of the OpenSHMEM library are open source and freely available to download. We recommend Sandia* OpenSHMEM as an open implementation that

supports various transport mechanisms, as well as OSHMPI, which implements OpenSHMEM on top of the Message Passing Interface (MPI).

Another blog entitled “Introduction to Distributed Communication: A Walk through the History of Distributed and Parallel Computing” tackled more background on general distributed and parallel computing.

To summarize, OpenSHMEM provides a model of computation in which copies of the same executable program are launched on multiple nodes, and each copy knows its own identity as a worker in the overall program via the PE number. So, how do PEs transfer data to and from each other? This is where special regions of the memory space within each PE come into play, called symmetric memory. Symmetric memory in OpenSHMEM is remotely accessible so that other PEs can read, write, and perform other operations to this memory.

OpenSHMEM has two forms of symmetric memory:

the data segment, which supports static and global variables
a dynamically allocated region called the symmetric heap supports dynamically allocated objects via routines like shmem_malloc()

Every PE has a symmetric heap of the same size, but the heap may possibly begin at different addresses. Objects are collectively allocated in each heap so that the set of objects have the same name and size on every PE. By working this way, OpenSHMEM lets you refer to remote objects by their local address plus the ID of the remote PE. This makes distributed programming feel very similar to shared memory programming.

OpenSHMEM provides the following operations:

Remote memory access: put, get, put-with-signal
Remote atomic: store, fetch, swap, fetch-and-add, and bitwise atomics (and, or, xor)
Synchronization: barrier, sync, wait, etc.
Collective: broadcast, alltoall, reduce, and collect (like MPI AllGather)

GPU Programming with SYCL

SYCL is the evolutionary descendent of OpenCL. It provides a cross-platform, open C++ programming environment for applications wishing to harness the power of parallel computing on GPUs. SYCL and its benefits are described in more detail in the following blog post and corresponding webinar: “Add Multiplatform Parallelism to C++ Workloads with SYCL.”

There are many resources for learning to program with SYCL, including the book

J. Reinders et. al.
Data Parallel C++: Programming Accelerated Systems using C++ and SYCL,

2nd Edition (2023), Apress, ISBN: 978-1-4842-9691-2

and the “Data Parallel C++: the oneAPI Implementation of SYCL*” resources landing page.

The general idea is that a program running on the CPU can launch parallel “kernels” on the GPU to do massively parallel computations. SYCL also has facilities for manually or automatically copying data back and forth between host and GPU memory systems. This is called “Unified Shared Memory”.

To create and initialize an array in GPU memory, one might write:

double *data = sycl::malloc_device<double>(10000, queue);
queue.parallel_for(sycl::range<1>(10000) [=](sycl::id<1> idx) {
    data[idx] = <something>;
}).wait();

This little program snippet will be automatically compiled into GPU code and, when executed, might run on thousands of GPU threads so that each one simultaneously initializes a single array element. Obviously, for a problem this small, there is little point to this power, but for large problems, it can lead to enormous speedups.

Several interesting and practical examples of exploiting SYCL kernels on GPUs are highlighted in the Intel blogosphere:

By leveraging SYCL, the oneAPI DPC++/C++ compiler environment enables direct programming of Intel Data Center Max GPUs. Please see the following references to learn more about this new lineup of GPU products targeting HPC and data centers:

Intel SHMEM

Intel SHMEM is a C++ software library that enables applications to use OpenSHMEM communication APIs with device kernels implemented in SYCL. It supports Intel HPC/AI-focused GPUs, beginning with the Intel® Data Center GPU Max Series.

The first open-source release of Intel® SHMEM v1.0.0 is now available!

Intel SHMEM implements a Partitioned Global Address Space (PGAS) programming model and includes a subset of host-initiated operations in the current OpenSHMEM standard and new device-initiated operations callable directly from GPU kernels. Below is a summarized list of features available in Intel SHMEM v1.0.0.

A complete specification detailing the programming model, supported API, example programs, build and run instructions, etc.
Device and host API support for OpenSHMEM 1.5 compliant point-to-point Remote Memory Access (RMA), Atomic Memory Operations, Signaling, Memory Ordering, and Synchronization Operations.
Device and host API support for OpenSHMEM collective operations.
Device API support for SYCL work-group and sub-group level extensions of Remote Memory Access, Signaling, Collective, Memory Ordering, and Synchronization Operations.
Support of C++ template function routines replacing the C11 Generic selection routines from the OpenSHMEM specification.
GPU RDMA (Remote Direct Memory Access) support when configured with Sandia OpenSHMEM with suitable Libfabric providers for high-performance networking services.
Choice of device memory (default) or Unified Shared Memory for the SHMEM Symmetric Heap.

In Intel SHMEM, OpenSHMEM APIs may be invoked from both host code and GPU code. OpenSHMEM functions called from the GPU threads are available in both regular and “work group” extension versions. The regular functions, such as ishmem_put, may be called independently from as many GPU threads as are available. In contrast, the “work_group” variants, such as ishmemx_put_work_group are called collectively by all the GPU threads in a SYCL work group and cooperate to implement the functionality.

The full source code of Intel SHMEM v1.0.0 can be accessed through the GitHub repo.

Instructions for how to build, install, and use Intel SHMEM are in README.md.
Please refer to our detailed guide on writing programs utilizing the Intel SHMEM APIs.
We also provide example codes demonstrating the usage of various

Using Intel SHMEM

Consider a program performing a distributed 2D stencil operation. In a stencil, each cell in a matrix is replaced by a computation on itself and the surrounding cells during each iteration. (for example, John Conway’s Game of Life implementation from 1970 is an example of this)

To do such a computation in a distributed way, each PE takes responsibility for a region of the matrix, such as a band or block. PEs must coordinate when working on the edges of their region because those areas are affected by neighboring PEs.

In the following code, a large matrix is distributed by row across a set of PEs. Each PE calculates updated values for its own share of the matrix and communicates data to PEs responsible for adjacent regions of the matrix for use in the next time step. This is a common pattern in distributed computation. The region shared by multiple PEs is called the “halo.”

Let us look at part of the Jacobi method example application source code packaged with the Intel SHMEM distribution on GitHub.

Using Intel SHMEM, one way to accomplish these updates in SYCL is as follows:

h.parallel_for(
  sycl::nd_range<2>{global_range, local_range}, sum_norm,
                    [=](sycl::nd_item<2> it, auto &sumr) {
    size_t iy = it.get_global_id(1) + iy_start;
    size_t ix = it.get_global_id(0) + 1;
    if (iy < iy_end && ix < (nx - 1)) {
      const real new_val = static_cast<real>(0.25) *
        (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
         a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
 
      a_new[iy * nx + ix] = new_val;
 
      // apply boundary conditions
      if (iy_start == iy) {
        ishmem_float_p((real *) (a_new + top_iy * nx + ix), new_val,     
                        top_pe);
      }
      if (iy_end - 1 == iy) {
        ishmem_float_p((real *) (a_new + bottom_iy * nx + ix), 
                       new_val, bottom_pe);
      }
    }
  });

In this code, the updated value for a location is the average of the four neighboring values. If the location is on the region's boundary, the update is also sent to the neighboring PE.

After each time step, all PEs synchronize using ishmem_barrier_all() to ensure that all the neighbor updates are done before a PE uses a value written by another PE.

This is fairly efficient code in a configuration where all the PEs are collocated on the same compute node because the calls to ishmem_float_p resolve to a single store instruction that writes the data to the neighboring PE’s memory on GPU over Xe-Links.

For configurations in which the PEs are not collocated, an alternate way to do this is to recognize that the remote updates may take a longer time compared to the computations for a single cell, so it is useful to run them as a batch communication in the background, using the non-blocking work_group API. In this case, a SYCL work group doing computations for a row on the edge of the region can use the Intel SHMEM API ishmemx_put_work_group_nbi to send a whole row of information to the neighboring PE.

Start with Intel SHMEM Today

Get started and use Intel MPI Library along with oneCCL, and Intel SHMEM libraries to enable distributed HPC and AI applications on the wide range of GPUs and accelerator devices supporting SYCL.

We are looking forward to your contributions and questions in the open-source Intel SHMEM GitHub project. If you would like to contribute to the project or raise any issues, please follow the guidelines provided in CONTRIBUTING.md.

For any other questions, please send an email to ishmem-discuss@intel.com.

Additional Resources

Manage Your Memory Address Space with OpenSHMEM*
Introduction to Distributed Communication
OpenSHMEM* Project

Intel SHMEM Resources

GitHub
README.md
Writing Intel SHMEM Programs
Intel SHMEM Examples