Unlocking the Next 35 Years of Software for HPC and AI

MaxTerry · ‎05-29-2024

At this year’s at ISC High Performance 2024 in Hamburg, Andrew Richards, Founder and CEO of Codeplay, an Intel company, delivered a Special Session on Unlocking the Next 35 Years of Software for HPC and AI.

Just as ASCI Red disrupted supercomputing in 1997, delivering a massively parallel system based on easily-obtained “off-the-shelf” technology, the future will expand openness beyond CPUs to new hardware accelerators accessible across multiple architectures and vendors, made accessible through entirely open software platforms.

Many HPC and AI workloads run best when deployed across a mix of architectures – CPUs, GPUs, and other accelerators. Yet the challenge remains that developers wanting to utilize the power of the highest-performing processors are often forced to compromise on openness. Different architectures have typically required unique languages, tools and libraries, but developers don’t want to have to program in different ways for different devices, which adds complexity and limits code reuse. This makes it difficult to take advantage of multiarchitecture systems and adopt new architectures, and inefficient to maintain code and optimize application performance. For example, developers using Nvidia GPUs have been locked into the CUDA stack.

oneAPI is a solution to these challenges – providing an open, standards-based, multiarchitecture programming model that provides true freedom of choice across accelerators while maximizing developer productivity. oneAPI gives developers an alternative to run their code on Nvidia and other vendor GPUs and accelerators – including Intel hardware.

The vision is to provide an open ecosystem of accelerator hardware and software, tested and validated, to make it easy for developers to choose the right technology for the job and bring new ideas to market faster. “It’s easier to integrate technologies when they’re open,” noted Richards. “Standards fit together. It’s harder with closed systems.”

Richards first became involved in the development of SYCL as a programming model for open portability. Intel helped take the movement a step further with the introduction of oneAPI and Intel toolkits – originally announced at Supercomputing 2019 – building on the SYCL programming model with the addition of libraries and implementations.

Now, this vision is becoming reality as the oneAPI specification has transitioned to fully independent open governance under the Linux Foundation, in the form of the Unified Acceleration (UXL) Foundation. Founding Members include ARM, Fujitsu, Google Cloud, Imagination Tech, Intel, Qualcomm, Samsung, and VMware.

Rod Burns, Chair of the UXL Foundation Steering Committee, joined Andrew on stage to provide an update, noting that since the UXL was announced in September 2023, the foundation has added eight contributing members, including RISC-V processor company Codasip and Mercedes-Benz, with another ten organizations in process.

Further evidence of the importance of open standards in software for AI and accelerated computing was provided by representatives of Argonne national Laboratory, Karlsruhe Institute of Technology, and Samsung.

Performance Portability

Kalyan Kumaran, Director of Technology and Senior Computer Scientist for HPC at Argonne National Laboratory, noted that while the Aurora supercomputer is based on Intel CPUs and GPUs, Argonne has different systems that use processors from many different vendors. He explained how all applications on the Aurora supercomputer use open standards and portable programming models to run applications that are also supported by other hardware. Approximately half leverage SYCL either directly or through higher-level programming models such as Kokkos, OCCA, and RAJA. PyTorch, TensorFlow, and Python are accelerated using oneAPI.

“Our software stack, whether it’s Nvidia or Intel GPUs, for the AI space, is very similar. So it’s very easy for our users to run their applications in any of the systems we have.”

For example, Argonne’s Hardware/Hybrid Accelerated Cosmology Code (HACC) simulates the complicated emergence of structure in the universe across cosmological history. Kumaran noted that HACC has been developed on CUDA, HIP, and SYCL 2020. He pointed out that “the SYCL developed code not only runs on Aurora but it can also run on the Nvidia GPUs and the AMD GPUs” and that “the native programming models supported on these platforms, be it CUDA or HIP, the SYCL version runs as fast.”

Developer Experience

oneAPI enables developers write high-performance software on anything from a laptop to a normal PC up to a supercomputer. Tobias Ribizel, Research Software Engineer and PhD. Candidate at Karlsruhe Institute of Technology, spoke to the developer experience with oneAPI. He demonstrated an example of a Ginkgo software package for sparse linear algebra running on an integrated GPU on a laptop CPU. His team started working with Intel on oneAPI and SYCL in 2020 and developed a separate backend that relies on SYCL to run on the latest Intel GPUs. He emphasized how the code was “compiled on a completely different server, with a completely different GPU, and is now running live on the … integrated GPU inside this laptop.”

Richards noted that this is an interesting example of using GPU accelerated computing on low-power devices that is relatively unexplored, and points to benefits of being able to develop software and libraries on different systems, from very small, embedded devices to laptops to supercomputers. In fact, UXL Foundation members are innovating on automotive and other embedded processors.

“One of the things that we've done in the UXL Foundation is we've documented everything very carefully and we've built it in a way that can run on different hardware systems” – that includes multi-vendor CPUs and GPUs, Arm, RISC-V, and other processors.

Even if you are designing your own hardware, “We make it very easy for you to bring your new hardware to the UXL Foundation,” said Richards. One way is to take advantage of oneAPI using Codeplay’s oneAPI Construction Kit, which is open source and includes tools, documentation, and tutorials.

New HW.png

Bongjun Kum, Staff Researcher at the Samsung Advanced Institute of Technology, has brought SYCL, oneAPI, OpenMP, and other standards to innovative hardware solutions for processing in memory (PIM) and processing near memory (PNM) to alleviate memory bandwidth bottlenecks. He spoke specifically about his experience developing extensions to SYCL and mentioned that it was “quite easy because we decided to design processing in memory and processing near memory APIs aligned to SYCL group functions” with small changes.

Richards noted that this was an innovation that could be proposed to the SYCL standard and UXL Foundation, “because when we designed cycle, we weren't thinking about processing a memory,” which would make these SYCL extensions with special features for memory valuable additions to the standards.

Check out the full session and visit uxlfoundation.org to learn more and explore how you can contribute your innovations!