Maximize Possibilities with Exascale Computing

Rick_Johnson · ‎02-20-2023

The Aurora exascale supercomputer at Argonne National Laboratory, built in partnership with Intel and HPE with the support of the U.S. Department of Energy (DOE), will deliver unprecedented capability for researchers across the sciences. At SC22, Jeff McVeigh, corporate vice president, interim general manager of the Accelerated Computing Systems and Graphics Group and General Manager of the Super Compute Group, spoke with Rick Stevens , Associate Laboratory Director at Argonne and Computer Science Professor at the University of Chicago. Rick described how the new system will accelerate the convergence of High Performance Computing (HPC), Artificial Intelligence (AI) and High Performance Data Analytics (HPDA), and use new hardware, software and storage technologies to maximize discovery.

Rick Stevens Jeff McVeigh

Jeff: Who is Aurora designed for?

Rick: We’re building Aurora because we need more compute power for science and applications. We’ll be in testing mode for a while, but the first users will come from the science community at Argonne, which will involve lots of applications ranging from drug design and machine learning to material science and climate models.

The second group of users comes from the Exascale Computing Project, where there have been over 25 science applications being built out over the last six or seven years. We want to use this machine to do things like make very high-resolution climate models, so we can understand how climate change impacts at the neighborhood level. That way businesses and governments can plan for impacts from climate change. We need safer and more energy dense batteries for electric cars, and we need them to be made out of materials that are easy to acquire. Aurora will help us make powerful models for this kind of materials design.

Jeff: There’s a lot of talk about the convergence of traditional HPC and AI codes. How will this apply to Aurora?

Rick: There are multiple ways in which AI and simulation will be coupled on Aurora. We’ll be running a lot of pure AI applications as well as ones that have traditional modeling and simulation coupled with AI components. There will also be workflows that have AI looping over and controlling applications in many different configurations.

One powerful class of AI-enabled supercomputing simulations are what we call AI surrogate models. With these, you take the computational kernel inside a simulation, and you train a deep learning model that essentially replaces that function. You can then use that model in inference mode to, say, screen millions or hundreds of millions of materials compositions to identify promising starting points for analysis. That will be much faster than using the original code.

While Aurora is going to be an excellent simulation platform, it will also be great for doing AI. It has a very powerful I/O system and a very large memory. With these capabilities it will be perfect for training large language models and large vision models.

Jeff: What were the key features of the Intel® Data Center GPU Max Series and the Intel® Xeon® Processor Max Series that made them important to use in Aurora?

Rick: The Intel Data Center GPU Max series as an engine is very well balanced in terms of the capability it has for traditional double precision workloads, as well as very efficient matrix capabilities for machine learning. It’s a great system building block, especially when it’s paired up with the Intel Xeon Processor Max Series. And once these processors are paired with high bandwidth memory, it’ll make it an amazing general purpose computing platform. We’re really looking forward to having our many thousands of nodes installed and up and running.

Jeff: What about debugging?

Rick: We have Aurora’s test and development system at Argonne that we affectionately call Sunspot, with 128 nodes. That will be the first opportunity for most users in terms of getting code support and some early performance data. We’re going to use that to build a rich environment that will be deployed across the whole system. Another big milestone is going to be getting the last Aurora node board installed.

Jeff: Can you talk about usability, software maturity, and things that we still need to work on?

Rick: The primary programming model for this machine is the Intel® oneAPI software stack. It’s a very rich software stack that has been out for a couple of years, so it’s getting pretty mature. It lets people develop across the whole collection of resources. That’s the primary target we want people to develop towards.

In the Exascale Computing Project we have quite a range of system software, libraries, packages, tools and models that also are part of the Aurora software stack that sit on top of this infrastructure. There are lots of different ways to get on the machine in terms of porting applications.

Jeff: How do you envision the DAOS environment helping deliver the value of the work?

Rick: It’s important to have a very powerful I/O system that is performance matched to the compute capability and primary memory capacity of Aurora, which is quite large. So early on we decided to go with DAOS, for a few reasons. One is because it has a mode where it’s not POSIX, and without using POSIX you get a lot more performance. DAOS allows us to break through that file system performance barrier.

We hope that we’re going to get about an order of magnitude more performance on DAOS than with a regular file system. That’s important for checkpoint restart and for huge data sets for machine learning, and it’s also important for I/O normalizing for simulations. We have a very large solid state storage system that’s behind the data servers, which supports hundreds and hundreds of petabytes of fast storage.

Jeff: What are you most excited about?

RickThe main thing we’re excited about is the fact that Aurora is on the floor. It’s getting closer and closer to a compute resource for the community. There are more nodes installed every day.

It’s also a massive machine, and honestly quite photogenic, with colored coolant tubes and wires to connect the switches and the 80,000 endpoints. Everybody who comes and sees it just kind of blown away.

Note: This interview has been edited for length and clarity. The public video with the full interview can be found here.

Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.