HPC Workload Convergence Paves the Way for AI in the Exascale Era

Trish_Damkroger · ‎05-15-2020

Today, we are witnessing a significant shift in the ways high performance computing (HPC) workloads and artificial intelligence (AI) converge, creating a powerful synergy between the two. The convergence is happening now in industries such as climate research and health and life sciences. Examples include:

Climate modeling and simulation, where AI is getting integrated into HPC workflows to accelerate weather pattern detection

Cancer research and drug discovery efforts at leading research facilities that perform molecular dynamics simulations combined with unsupervised machine learning to inform the next logical series of simulations

Supervised machine learning models that are used to predict responses between drug properties and tumors undergoing analysis

CERN, the European Organization for Nuclear Research, that has demonstrated generative adversarial network (GAN) enabled models to run faster than traditional first principle based models

With so many use cases, organizations today need the flexibility to choose an architecture that meets their unique needs.

While the benefits of convergence are clear, it will take time to overcome challenges. The HPC community has traditionally focused on compute-centric workflows where distributed memory algorithms, double-precision floating-point math, and large-scale data storage take precedence. In contrast, the AI community builds upon the analysis and interpretation of a multi-sourced data deluge. They focus on data-centric workflows where reduced precision math, and a range of applications from image classification, translation, and recommendation engines to autonomous driving take center stage.

These differences create the potential for the expansion of two ecosystems rather than one. Instead, we need to bring them together to tackle new and more complex challenges. It’s not just HPC that benefits from this convergence, though. As AI models continue to grow in their size and need for compute power, training workloads will need rely more heavily on the technologies underlying HPC like MPI and fabric. The convergence will tie the two ecosystems together, innately, over time providing mutual benefit.

Intel recognizes that a critical element of accelerating this convergence is building an infrastructure comprised of technologies focused on delivering high performance across a broad range of workloads, including AI. The Intel® Xeon® Scalable processor does just that. Innovations such as Intel® AVX-512 and Intel® DL Boost technology offer instructions to make optimal use of hardware. Currently, over 20 million developers worldwide code for the Intel Architecture.

Intel also recognizes that given the evolving and nascent nature of AI algorithms, and demand for varying power and performance curves, there is a need for other compute architectures such as FPGA, GPU and ASIC, further driving Intel’s investments in these areas.

Empowering convergence

Exascale systems, being designed today for delivery soon, will enable innovations and discovery across a range of fields, while further accelerating the convergence of HPC and AI. However, I think we are still in the initial forays of a fully HPC-AI converged world, where smart cities are commonplace and autonomous driving grows in adoption. These changes will require a paradigm shift in system design.

Some of the critical areas where I believe this will take place are:

A strong focus on workflow optimization will help organizations get the most from their infrastructure. Dynamic reconfigurability and composability will facilitate a broad set of evolving workloads powered by diverse compute architectures. oneAPI will offer developers additional resources to reduce the complexities associated with application development and optimization in heterogeneous environments.

An expanded memory storage hierarchy will provide access to a larger memory pool to reduce I/O bottlenecks without trade-offs between performance, capacity, and capability. Distributed Asynchronous Object Storage (DAOS), the open source Exascale-capable storage stack, aids in this advancement with the low-latency, high-bandwidth, and I/O throughput needed to speed converged workflows.

Simplified software abstraction capability with hardware-agnostic programming models will help facilitate application portability and maintainability. Developers also gain greater freedom in coding knowing their applications can run effectively across different architectures.

A new intelligent system software paradigm that can manage all aspects of the infrastructure with orchestration driven by AI/machine learning-driven discovery, management of resources, dynamic application-aware provisioning, and real-time manageability via industry standard APIs.

Bringing the convergence at Exascale to reality

We are excited to partner with the Argonne National Laboratory (Argonne) to enable them to deploy one of the first US Exascale systems named “Aurora.” It is anticipated that Aurora will deliver an exaflop -- and an amazing performance increase on traditional HPC, AI, and high performance data analytics applications in comparison with Argonne’s existing system.[1]

Additionally, Argonne expects that the mix of applications running on Aurora will change over the next 3-5 years, with up to 40% of the supercomputing jobs run on Aurora to be machine learning applications. Leading research & academia programs are engaged to harness the capabilities of Aurora and enable the software ecosystem, all while driving an unprecedented level of innovation across HPC, machine learning, and data analytics. The foundation of Aurora will be a future Intel® Xeon® Scalable processor codenamed “Sapphire Rapids.” In addition, there will be a GPU based on Intel® X^e architecture codenamed “Ponte Vecchio.” Ponte Vecchio will be optimized for HPC and AI, Intel® Optane™ Persistent memory technology, and oneAPI, the unified programming model based on open standards.

In the coming weeks, I will be authoring two more blogs that will walk through some of the other innovations we are driving across software, heterogeneous computing, and at the system level. These advancements, coupled with oneAPI, will accelerate the convergence of HPC and AI. Thanks for reading!

Notices & Disclaimers
Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
All product plans and roadmaps are subject to change without notice.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

[1] https://aurora.alcf.anl.gov/