Tools
Explore new features and tools within Intel® products, communities, and platforms
77 Discussions

Harnessing the Power of Heterogeneous Computing with SYCL and oneAPI

Anshul_Gupta
Employee
1 1 2,829

Introduction

Migara Amarasinghe, a PhD candidate in Electrical Engineering at Florida State University, has had a lifelong fascination with the mechanisms of computers. From building personal computers in his childhood to conducting advanced research in High-Performance Computing (HPC) and performance analysis of AI algorithms in heterogeneous computing environments, his journey has been full of different experiences. In addition to his doctoral research with his advisor, Dr. Simon Foo, Migara currently works as a Research Assistant in AI and a Teaching Assistant in microprocessor-based system design at FAMU-FSU College of Engineering. With the supervision of Dr. Shonda Bernadin, he also dedicates his time to mentoring community engineering programs and fostering diversity and inclusion within the engineering field. Outside of his academic pursuits, Migara enjoys his passion for music as a guitarist/vocalist in a local band in Tallahassee, Florida.

Student Ambassador Experience

Migara’s involvement in the Intel Student Ambassador program has been a unique and enjoyable journey. The opportunity to network with fellow student ambassadors, establish connections with Intel employees, and participate in diverse virtual workshops/hackathons offered him enriching experiences. In addition, under Dr. Bernadin’s guidance and with Intel Corporation’s support, Migara led a two-day Intel workshop at the College of Engineering. He hosted a comprehensive session on “Introduction to oneAPI and High-Performance Computing with oneAPI,” which involved facilitating hands-on activities and quizzes, thereby encouraging active student participation.

CUDA to SYCL

This project revolved around a central theme: benchmarking and analyzing performance across various hardware configurations. Migara continues to gather insights from this ongoing project, information that he intends to incorporate into a distinct chapter of his dissertation, specifically dedicated to oneAPI. In this project, his study leveraged the open-source SYCLomatic tool, an essential tool for transitioning from NVIDIA CUDA code to C++ SYCL, enabling the effective usage of accelerators from multiple vendors such as Intel, NVIDIA, and AMD.

The project involved a sequential set of steps. It began with installing the SYCLomatic tool on a CUDA development machine, leading to the migration of CUDA code, a program that performs Hermitian matrix multiplication, to illustrate the various command line options. Subsequently, he compared the migrated SYCL code to its original CUDA counterpart (which includes the cuBLAS library and an external utility header file) to understand the efficiencies of the transition. Certain parts of the migrated code, including the external header file, required manual intervention and modification. Refer to the GitHub repository to see the manual changes with comments. The final steps were the compilation and execution of the migrated C++ SYCL code on multiple consumer-level Intel, AMD CPUs, and Intel, NVIDIA GPUs. This also included an Intel NUC minicomputer. Having completed the migration, Migara initiated a comprehensive performance benchmarking process after ensuring that the migrated code’s output matches the original CUDA output. He tested the migrated code across different hardware combinations and operating systems, such as Linux, Windows 10 & 11, ensuring a thorough analysis under a multitude of conditions. Migara also utilized the power of Intel Developer Cloud to measure and analyze the performance of their cloud-based hardware platforms, further strengthening his study’s robustness.

student-ambassador-migara-ai-blog-fig01.png

 Figure 1: Workflow of CUDA to SYCL migration using the SYCLomatic tool.

Challenge

The main challenge that motivated Migara was the inefficiency and incompatibility issues that arise in heterogeneous computing environments. With hardware accelerators becoming increasingly diverse, ranging from CPUs to GPUs and even FPGAs, optimizing performance and productivity across this diverse hardware is a critical issue. Particularly, the transition from vendor-specific programming models like NVIDIA CUDA to a more universal standard like C++ SYCL could be fraught with performance and efficiency bottlenecks.

Migara’s objective was to take on these challenges head-on by utilizing SYCLomatic and Intel® oneAPI Base Toolkit to migrate from NVIDIA CUDA code to C++ SYCL and benchmarking the performance across different hardware configurations and operating systems. His motivation emerged from the belief that effective utilization of heterogenous computing environments is crucial for the future of High-Performance Computing, and optimizing code for these environments can significantly improve the overall performance and efficiency of these systems.

Why it Matters

The significance of benchmarking and performance analysis cannot be overstated in our increasingly digital world. With the large number of hardware configurations and the explosion of high-performance computing demands, understanding how applications perform across these different environments is vital. It allows developers and researchers to optimize their programs, ensuring they can run as efficiently as possible on a variety of setups. Additionally, it offers insights into the capabilities of different hardware accelerators, thus aiding decisions on system design and allocation of computing resources. In a nutshell, projects like Migara's are paving the way for more efficient, robust, and diverse high-performance computing environments.

 

Solution/Results

A portion of his project focused on evaluating the performance of an application that calculates the product of two matrices: Hermitian matrix multiplication. The output of the original CUDA code matched that of the migrated SYCL code, prompting to run benchmarks under two distinct scenarios. In the first scenario, the matrix size was set to 512x512 and tested with varying iterations: 1, 10, 500, and 10,000. See Figure 2.

student-ambassador-migara-ai-blog-fig02.png

Figure 2: 512x512 Matrix multiplication (HEMM) benchmark.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

In the second scenario (see Figure 3), the matrix size was resized to 10x10 but retained the same iteration counts. The programs were executed on different environments, such as Windows 11 and Linux, via WSL2. The testing was carried out using an assortment of hardware, including an NVIDIA RTX 3090, NVIDIA TITAN Xp, AMD Ryzen 9 5900X, and a 13th Gen Intel Core i9-13900K. Significantly, among CPUs: Intel i9-13900K and among GPUs: NVIDIA RTX 3090 showcased superior performance. This computation provided meaningful insights about the performance characteristics of different hardware and software setups. For instance (setting matrix size to 512x512), SYCL implementations without enabling GPU (on Intel Core i9-13900K, AMD Ryzen 9 5900X workstations, and Intel NUC 12 Pro workstation) were significantly faster than the CUDA version when the iteration count was low (1 and 10). They showed a roughly 73% to 97% improvement, or SYCL code execution on Intel Core i9-13900K was approximately 39 times faster than the CUDA execution. The performance of the GPUs during lower iterations was underwhelming, which is assumed to be due to kernel launch overhead. As the iteration count increased, the SYCL versions without GPU support became slower than the CUDA version with GPU support.

student-ambassador-migara-ai-blog-fig03.png

Figure 3: 10x10 Matrix multiplication (HEMM) benchmark.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Migara also effectively tapped into the Intel Developer Cloud, utilizing its resources to execute the ported SYCL program across a range of hardware setups. The performance on the Intel® Data Center GPU Max 1100 with 4th Gen Intel® Xeon® processors - 1100 series (4x) was notably superior compared to the consumer-grade configurations that were tested locally. Moreover, when benchmarking the previously mentioned consumer-grade hardware configurations on different operating systems, the execution times on Linux for both CUDA and SYCL programs were faster compared to Windows 11.

Visit GitHub to see other GPU-enabled SYCL implementations.

student-ambassador-migara-ai-blog-fig04.png

Figure 4: 512x512 Matrix multiplication (HEMM) benchmark on Intel Developer Cloud

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Performance profiling and bottleneck identification were made possible with the Intel VTune Profiler, while Intel Advisor helped to identify high-impact optimization opportunities in the design. These tools collectively facilitated a productive project experience.

 

student-ambassador-migara-ai-blog-fig05.png

Figure 5: Intel VTune Profiler displaying Hotspots.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

student-ambassador-migara-ai-blog-fig06.png

Figure 6: Intel VTune Profiler displaying CPU utilization.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

student-ambassador-migara-ai-blog-fig07.png

Figure 7: Intel Advisor Survey & Roofline.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Conclusion

Migara’s project on performance analysis and benchmarking is not only broadening his understanding of different hardware configurations but also enhancing a sub-area of his PhD dissertation. This project employed a range of tools that significantly contributed to the experiments. The oneAPI toolkit and HPC toolkit were pivotal, providing a unified programming model and high-performance resources respectively, for efficient work across diverse hardware architectures. SYCLomatic was used to transition NVIDIA CUDA code to C++ SYCL, streamlining the code porting process and enabling interoperability with a range of hardware accelerators. In the pre-migration phase, the NVIDIA CUDA Toolkit was used for code development, optimization, and debugging. As the project progresses, several enhancements are yet to be implemented after the analysis of codes using Intel VTune Profiler and Intel Advisor. He intends to conduct a fine-grained analysis of performance results, comparing specific CUDA and GPU-enabled SYCL codes to identify significant performance shifts post-migration. Expanding the range of tested hardware and benchmarking on the new instances that will be introduced to the Intel Developer Cloud in the future can add robustness to the findings and better simulate real-world applications. Migara also plans to put SYCLomatic to the test with larger, more complex CUDA codes, providing an understanding of the tool’s handling of extensive applications.

Want more?

Please see the new upcoming GitHub repository for more information on this project and future projects and to find tutorials on how to configure/run environments.

https://github.com/myasaswin/CUDAtoSYCL

Get The Software

You can get the Intel DPC++ Compatibility Tool included as a part of the Intel oneAPI Base Toolkit. The SYCLomatic project is available on GitHub.

Intel® oneAPI Base Toolkit

Develop performant, data-centric applications across Intel® CPUs, GPUS, and FPGAs with this foundational toolkit.

Get It Now

See All Tools

 

Thank you to Ugonna Chikezie for contributions on this blog. 

Tags (2)
1 Comment
YuriAchermann
Beginner

Great post!