Hands-On with Intel at Exascale

Rick_Johnson · ‎08-29-2023

Posted on behalf of Aaron Dubrow

Summer workshops organized by Argonne and Intel let researchers fine-tune codes for Aurora.

Argonne National Laboratory and the Intel Center of Excellence (CoE) at Argonne have been hosting workshops for researchers and developers for over six years. With the Aurora supercomputer fully installed, edging closer to deployment, and available for hands-on science - along with the test-and-development Sunspot system - the workshops have taken on a new significance.

In June and July, Argonne, the Intel CoE, and developers led a series of workshops and ‘dungeons’ - for researchers from the U.S. Department of Energy’s Exascale Computing Project (ECP) and Argonne’s Aurora Early Science Program (ESP) communities. For many, these represented the first opportunity to evaluate the fully deployed nodes of Intel® Max Series CPUs and GPUs on their scientific computing codes.

“We’re doing everything we can to ensure researchers can submit jobs when we give them access to Aurora,” said Chris Knight, Argonne computational scientist, ECP Lead for Applications Integration on Aurora, and co-organizer of the Aurora CoE workshop. “I take it as a very positive sign that the software stack and the system were available, functioning, and usable by these teams with all their diverse applications.”

The Aurora supercomputer at Argonne National Laboratory

Mega-Hackathon for Code Readiness

Over the course of a week, nearly 50 teams worked directly with Argonne and Intel experts to work out the kinks of their workflows and optimize their code performance on Aurora nodes.

“This was a great example of teams coming together,” Knight said. “It was a giant hackathon with all these teams computing on Sunspot and identifying what needs to be done. I was pleased to see the excitement around the system.”

Knight was part of the team testing the LAMMPS community code – a flexible molecular dynamics simulator used by thousands of research teams across materials science, biology, and physics.

Over the past several months, working with performance experts at Argonne and Intel, his team has been able to dramatically improve the performance of LAMMPS on the Intel Max Series GPU. It now outperforms Nvidia, runs at 92% of a Nvidia H100 PCIe card, and is getting faster by the week.1

“LAMMPS performance is still a work-in-progress on the individual GPU level,” Knight acknowledged, “but having the code run full scale on Aurora – I think that's something that’s very doable, and there will be some teams that will need to do that.”

Enabling Hybrid AI and Classical Workloads

Professors Aiichiro Nakano and Ken-Ichi Nomura, both from the University of Southern California, are two other researchers who benefitted from the workshop. They are developing AI-driven quantum dynamics codes, RXMD-NN and QXMD, which they use for a range of problems, from understanding extremely large-scale and long-time dynamics to identifying new materials that can be used for next-generation quantum or spintronic computer systems.

Their physics-informed neural networks rely on both classical methods – to train the models with expensive quantum simulations – and AI – to replace aspects of the code with surrogate NN models and to use reinforcement learning to extend the models to time lengths that are inaccessible using regular approaches.

“This was the first workshop where we really worked on the Intel GPU architecture line with a bunch of Intel scientists and engineers helping us,” Aiichiro Nakano said. “So that was really productive and useful.”

Their methods require separate software stacks – one C++ based, the other dependent on Python. Unifying these methods for fast performance is a challenge they worked to overcome in the workshop.

“We want to run at exascale. When Aurora comes online, we’ll be one of the early users,” Nakano said. “Although the entire code is not fully working, we solved so many problems and had great success, and we really appreciate having access to the latest development environment.”

‘Dungeons’ take on machine learning at scale,

In July, Denis Boyda, a post-doctoral researcher at the NSF AI Institute for Artificial Intelligence and Fundamental Interactions, led a 'dungeon’ focused on solving problems in Lattice QCD – a way of calculating the strong force of the universe using HPC – and incorporating machine learning on Aurora.

“We are collaborating closely with both the Intel and ALCF teams to facilitate the integration of machine learning [ML] into scientific simulations on the Aurora platform,” Boyda said. “Running training for large-scale ML models on Aurora's exascale computing power opens new possibilities and transformative applications in particle physics.”

Though large-scale Aurora simulations still require further development and improvement, Boyda says, “Intel's hardware and software are in a commendable state for conducting ML-based simulations. We hold an optimistic outlook on their preparedness for the Aurora platform.”

Additional dungeons in July focused on PHASTA ML, which integrates machine learning with a leading compressible fluid modeling code, and Connectomics ML, an ambitious effort to bring AI into neuroscience to expand the ability to simulate neurons in the brain.

‘But what is the problem that you really want to solve?’

The researchers participating in the workshops are among the most experienced computational scientists in the world. But still, something is different about the Aurora opportunity, according to Knight.

“We work with many teams to help them be successful and reach their goals. In many cases, we get to have that conversation with them along the lines of: ‘You solved the science problem that you had set out to solve, but what is the problem that you really want to solve? What were the approximations that you didn't want to make? What is the physics that you didn't get to include, or the time scales that you couldn't reach? And how can we help you get there with a big machine like Aurora?”

The road to exascale computing has required patience and fortitude. Michael D’Mello, who leads the Intel COE and helps prepare the community for the system, has been working on the Aurora project for over a decade – much of that time before physical hardware was available.

“In 2019, we organized a class on openCL and taught concepts for how to program on a GPU, long before we had a Ponte Vecchio GPU to work on,” D’Mello said. “We took SYCL [a heterogeneous programming model based on open standards and C++] to the Aurora community before people at Intel had even heard of it. And at SC19, we had more to say about Aurora than people wanted to hear, even though there was no Ponte Vecchio and Intel was just announcing oneAPI. All this happened pre-release, and it’s what was needed to develop this new paradigm and prepare the community.

“It’s incredibly gratifying now to have the system hardware in place and researchers using it to solve their most challenging problems.”

Read about Aurora HPC- and AI-enabled science, participate in training and events, and stay tuned to Intel and Argonne news for more milestones on the path to exascale.

--

1 Visit the International Supercomputing Conference (ISC’23) page on intel.com/performanceindex for workloads and configurations. Results may vary.

Header image:

Simulation of light control of ferroelectric topological structures using Neural Network Quantum Molecular Dynamics (NNQMD).

Credit: Thomas Linker, Stanford, Collaboratory for Advanced Computing and Simulations University of Southern California; Ken-ichi Nomura, Collaboratory for Advanced Computing and Simulations University of Southern California