Add Multiplatform Parallelism to C++ Workloads with SYCL

Nikita_Shiledarbaxi · ‎10-09-2023

Easy ways of CUDA-to-SYCL migration with SYCLomatic and Intel® DPC++ Compatibility Tool

Authors: Nikita Sanjay Shiledarbaxi, Rob Mueller-Albrecht

SYCL is a standard C++-based open framework for parallel programming. It lets you leverage the advantages of parallelism across multiple hardware architectures. Apart from enabling multi-architecture support, it gives you the flexibility of accelerating your C++ application across hardware from multiple vendors. This contrasts with proprietary frameworks such as CUDA* that restrict you to the capabilities of vendor-specific hardware.

Adding multiplatform supported parallelism to your C++ code base allows you to utilize the processing power of multiple cores across heterogeneous accelerators. Our recent webinar, "CUDA* to SYCL: Add Multiplatform Parallelism to Workloads" covered how to add cross-architecture parallelism to your C++ application. It revolved around:

Benefits of adopting SYCL
Three simple SYCL kernel concepts: parallel_for, nd_range and hierarchical parallel kernels
CUDA to SYCL code migration using SYCLomatic or Intel® DPC++ Compatibility Tool

This blog will go over some of the highlights of the webinar. See the complete recording here:

Benefits of Adopting SYCL

Before we get into the CUDA-to-SYCL migration details, you may wonder, "WHY SYCL?". Here are some of the key points behind the important role SYCL plays in democratizing highly parallel offload compute and opening access to its benefits to a diverse and customizable range of platform configurations.

Hardware Portability and Vendor Neutrality

Hardware portability is a key advantage of SYCL. Unlike the CUDA programming model restricted to NVIDIA GPUs, SYCL is designed for parallel computing across heterogeneous accelerators such as CPUs, GPUs, and FPGAs from different manufacturers. The flexibility of targeting multiple hardware platforms by writing common code makes the application versatile and future-proof.

Open Standards and Interoperability

SYCL is based on open standards developed by the Khronos Group. Its compatibility with a range of other industry standard frameworks (as depicted in Fig.1 below) makes it easy to integrate your SYCL code with the existing code or libraries. SYCL thus fits into an ecosystem for choice and implementing parallel computing once and reusing the code assets on various platforms.

Fig.1: A growing ecosystem of SYCL

Now that you know the major advantages of adopting SYCL, read further to explore how to add parallelism to your workload.

Two Ways to Add SYCL-Based Data Parallelism

1. Addition of SYCL Kernels

The simplest way to add parallelism to your C++ workload is to add a SYCL kernel[1] based on the SYCL kernel APIs. The SYCL kernel creates multiple instances of a single operation that will then run simultaneously, thus resulting in parallel execution. SYCL follows the concept of 'queue' that resembles streams on a video device. Each device queue maps to a specific kernel. Thus, kernels submitted to a device queue are executed by that target device.

Various kernels with different dimensions (1D, 2D, or 3D) and supporting different features are available. The webinar illustrates how a simple example of 2D matrix multiplication can be transformed for more efficient parallel offload computation using three different types of kernels mentioned below:

An easy-to-implement basic parallel kernel suitable for algorithms relying on very straightforward data parallelism. (Watch the webinar from [0:09:37])
The nd_range kernel lets you control how different work items handed over to the target device are organized and managed on the offload device. (Watch the webinar from [0:11:11])
A hierarchical parallel kernel that provides a more structured top-down approach to express parallel loops compared to the nd_range kernel. (Watch the webinar from [0:13:00])

Various other kernels are available, too. Each can be suitable for different algorithms. You can choose an appropriate kernel or a combination of kernels based on your application requirements.

Check out the article: How to Extend C++ Applications to your GPU with SYCL

2. CUDA to SYCL Migration

Another way of shifting to a SYCL-based parallel programming paradigm is to migrate your existing CUDA code to SYCL code. The easiest way to do so is using the automated migration tool that comes in two flavors:

SYCLomatic: an open-source migration tool
Intel® oneAPI DPC++ Compatibility Tool: an Intel® product version of SYCLomatic

Read further to dive deeper into the migration tools.

DPC++ is the oneAPI's implementation of SYCL. Check it out here.

Note: Both tools are functionally similar, with minor differences in the rollout of routine updates and bug fixes.

CUDA to SYCL Migration Workflow

The complete process migration spans 5 comprehensive steps, as shown in Fig. 2. It all starts with some preparatory work of identifying CUDA source files. Then comes an optional step where you can create a JSON-formatted compilation database using the intercept-build tool+. The migration tool then automatically migrates the majority of the CUDA files (typically 90%-95% [3] of the CUDA code) to oneAPI's implementation of SYCL, while non-CUDA files are kept unchanged. The migrated output may require some plumbing work for functional correctness.

Fig2 CROP.png

Fig.2: Step-by-step CUDA to SYCL migration procedure

Check out the webinar from [0:18:20] to understand the code migration steps in detail.

+ The migration tool provides an intercept-build script that keeps track of all the build process details. It writes the compilation options, macro definitions, and include-paths to a JSON-formatted compilation database. The database provides the exact build settings and eases the understanding of dependencies for the migration tool. The intercept-build utility is CLANG-based and operates on make and cmake build environments. Watch the webinar from [0:21:29] to learn more about the optional intercept-build step in the migration process.

Code Migration Example

Fig.3 below shows the simple GPU offload vector-add code snippet in both the original CUDA source and the SYCLomatic-migrated forms.

Fig.3: Example of Vector-Add code routine migrated from CUDA to SYCL

Watch the webinar from [0:27:25] to understand the code migration of the Vector-Add routine.

Check out the GitHub repo for the vector-add code sample.

The webinar from [0:44:48] briefly demonstrates migrating a CUDA code sample of N-body [2] simulation to SYCL. The complete project is available on the Codeplay GitHub repo and the oneAPI GitHub repo.

Migration of CUDA Library Calls

The migration tools also migrate the CUDA Toolkit library calls to libraries such as cuBLAS, cuRAND, cuFFT, CUB, etc. The Intel® oneAPI Math Kernel Library (oneMKL) provides a SYCL API that makes migrating CUDA math library calls to oneMKL function calls easy and transparent.

NOTE: The list of CUDA library calls that the migration tools can automatically migrate is expanding rapidly.

Visit the URL https://github.com/oneapi-src/SYCLomatic/tree/SYCLomatic/clang/lib/DPCT/APINames<lib>.inc to check the currently supported libraries.

Learn more about the oneMKL SYCL API through oneAPI GitHub Repo and oneMKL Developer Reference Guide.

Check out the oneMKL documentation and our recent oneMKL blog.

Inspect the Migration Output

The output file of the migration tool contains some verbose hints and comments for you to review and hand-tune the migrated code. Some warnings and diagnostic codes (also called diagnostic reference numbers) inform you about aspects such as hardware-dependent API logic in the migrated code and recommendations to validate workgroup sizes for different architectures. The diagnostic reference webpage helps you interpret each diagnostic code and guides you on how to solve the stated issue.

What's Next?

Get started with the SYCLomatic tool and the Intel® oneAPI DPC++ Compatibility Tool today! Easily migrate your CUDA code to SYCL and avail yourself of accelerated parallel computing across multi-vendor hardware architectures. Following are some additional resources for CUDA to SYCL migration.

We also encourage you to know about other AI, HPC, and Rendering tools in Intel's oneAPI-powered software portfolio.

Get The Software

You can get the Intel DPC++ Compatibility Tool included as a part of the Intel oneAPI Base Toolkit. The SYCLomatic project is available on GitHub.

Acknowledgment

We would like to thank Chandan Damannagari for his contribution to this blog.

[1] Kernel is an abstract concept to express parallelism and to leverage the hardware resources of the target device.

[2] N-body is a popular algorithm for simulating the motion and interaction of celestial bodies in a fictitious galaxy.

[3] Intel estimates as of March 2023. Based on measurements on a set of 85 HPC benchmarks and samples, with examples like Rodinia, SHOC, PENNANT. Results may vary.