Developer Ecosystem Experience: Implementing Stencil Computations in SYCL

Rob_Mueller-Albrecht · ‎07-18-2023

At oneAPI DevSummit 2023 a few weeks ago, István Zoltan Reguly shared his experiences implementing mesh stencil loop computations using SYCL*. He is an Associate Professor leading the High-Performance Computing Lab at Pázmány Péter Catholic University’s Faculty of Information Technology.

In his talk, he reported on some of his findings as a very active contributor to the Intel® Software Innovator Program. He focuses on multidimensional parallel_for computations with or without nd_range, and different ways to avoid race conditions in unstructured meshes.

Let us find out how SYCL and other offload and distributed compute approaches interact with these types of compute problems.

In addition to SYCL he also looks at different GPUs and alternative parallelization approaches such as OpenMP* and CUDA* as well as the use if hipSYCL* (OpenSYCL*).

This blog gives a brief synopsis of some of the key findings that were presented.

Watch it Now:

You can watch the full video recording and download the accompanying presentation slides on the oneAPI DevSummit Landing Page located on the oneAPI Initiative Hub.

Motivation

The key motivation to get hands-on deep-dive experience with software programming abstraction frameworks is to understand the underpinnings, the plumbing, that goes into making programming frameworks like SYCL work. This plumbing is what allows it to be used in the context of even high-level abstractions.

The ultimate goal is to make the life of the software developer, who intends to write truly portable code, easier. In other words, he is looking to find out how to bridge the gap between high level abstraction and lower-level implementation.

SYCL is one such implementation sharing István’s goal.

SYCL has this very important purpose to improve the productivity of programmers by letting them write a single reasonably high level C++ code implementation and then allow them to go and target multiple different architectures with that.

Figure 1: SYCL and its Open Standards Support for Many Hardware Platforms.

Achieving Code and Performance Portability

Of course, portability across architectures is a moving target and always evolving. It requires active community engagement from developers and researchers like István to make it happen.

In addition to solving the question of code portability across architectures, an accelerator offload framework like SYCL also needs to target performance portability. An abstraction layer will have a certain level of resource management overhead. This shall be kept as limited as possible without sacrificing ease of use.

Generally, that is easy to achieve for workloads that only require a single large data transfer from host CPU to target GPU, followed by computation on the GPU and results transfer back to the control unit.

The challenge comes in when the computation model looks a little bit more complex, and thus more coordination is required. That is where architectural specifics, memory management, and thread or task scheduling can have a big impact. The more complicated your application, the more you need to play with how exactly you are going to express parallelism and exploit hardware features.

A parallel programming abstraction layer like SYCL needs to balance overhead and detailed architectural awareness of the target device specific runtime. All this while ensuring efficient device queue dispatch, memory management and stringent execution and data access reliability controls.

In his investigation István uses different complexity levels of structured and unstructured mesh stencil loop implementations based on common scientific reference implementations such as

CloverLeaf 2D/3D Models
OpenSBLI* Finite Differential Equation Solvers
Acoustic Solvers
MG-CFD Manufacturing Computational Fluid Dynamics Modelling

and others to approximate different levels of complexity.

Beyond CUDA, SYCL, hipSYCL and OpenMP* he also goes one step further and looks at distributed computing scenarios by adding MPI* to the mix.

Not surprisingly key findings are that the following aspects of workload execution in a distributed environment of any architecture combination are key:

Efficiently managing data and keeping data local to the execution unit as much as possible
Managing the risk of race conditions especially if the application uses unstructured mesh data.
Access to atomics and advanced shared parallel memory management methods.

Some things the runtime libraries underpinning SYCL can guess reliably for the given hardware you are running on. In this study by István Reguly the workgroup size that should be used with multi-dimensional parallel_for loops was consistently predicted correctly. So instead of a developer having to fine-tune this for each target hardware, they can leave this up to the runtime, and more often than not, it will make a very reasonable choice.

This short summary cannot do full justice to the research and testing done at Pázmány Péter Catholic University’s Faculty of Information Technology. Please have a closer look at the presentation to see the full depth and detailed findings.

One Common Code Base for All

The conclusion of the presentation is that a lot depends on the lower-level backend implementations and hardware specific runtimes regardless of whether they are in OpenCL or SYCL. That said the abstraction SYCL provides maps down nicely to GPUs or the vectorization and accelerator units of the CPU.

SYCL can deliver in terms of performance and match the vendor specific programming abstractions.

You can indeed run a single set of source code on all these different types of architectures.

Watch it Now and Try for Yourself

There is a lot of detailed analysis that went into the findings, that were presented.

Please watch the full video recording to get the next level of detail for all the findings discussed in this oneAPI DevSummit presentation.

Join us on this journey, download and try oneAPI as well as the Intel® oneAPI DPC++/C++ Compiler with SYCL Support and the Intel® oneAPI DPC++ Library (oneDPL) yourself.

Additional Resources

Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Results may vary.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
*Other names and brands may be claimed as the property of others.