Posted on behalf of Tim Mattson, Intel Senior Principal Engineer
New programmers, the young minds who entered the field of high-performance computing (HPC) over the past decade, are changing the field of HPC and scientific computing, and commensurately HPC hardware and software design. This trend became apparent in 2016 when the ACM Gordon Bell awards competition included a startling finalist: PyFR, an application written in Python. The inclusion of a Python application in this prestigious HPC competition demonstrated that the Python ecosystem (which is rapidly evolving and contains native code generation tools plus native C/C++ and Fortran libraries) had matured to the point where its productivity and performance is now an attractive choice for HPC applications as demonstrated by an accomplished team of young programmers who chose to model unsteady turbulent flows, at scale, on the world’s fastest supercomputers using the Python language and ecosystem.[1] Fast forwarding from this 2016 innovative accomplishment, Python is now considered a top language in computational science and engineering.[2]
Many of the young, current generation of programmers never step outside of the Python ecosystem to create their HPC applications. This reflects what the programmer of the future will look like.
The Python software ecosystem now provides high-performance access to both CPU and GPU capabilities and most major numerical libraries, including those that are exascale-capable. Examples include PETSc, SLATE, Ginkgo, heFFTe, SUNDIALS, hypre and more. Similarly, Python is the mainstay of most data scientists as nearly all machine-learning applications are now written inside Jupyter notebooks using the popular PyTorch, JAX, and TensorFlow packages as well as classic data science libraries such as NumPy, SciPy, pandas, scikit-learn, and xgboost. These packages enable the convergence of HPC and AI, which many (perhaps most) young programmers achieve through the use of popular Python packages and numerical libraries. As demonstrated in the 2016 ACM Gordon Bell competition, and repeatedly since, Python performance is no longer a barrier to achieving high HPC performance. Meanwhile interfaces to storage libraries such as HDF5 and, of course, MPI, give access to the high-performance storage and network capabilities of a modern HPC system.
Vendors Must Adapt
The HPC community is now committed to a future based on a hierarchical, heterogenous, distributed computing architecture. Of course, this is a journey and not a destination as machine architectures are rapidly evolving. Python programmers, as well as native C++ and Fortran programmers, must be supported in this journey.
This rapidity of change in system architectures, including the addition of new capabilities in the device hardware instruction set architectures (ISAs) along with the complexity of heterogenous, distributed computing environments, is forcing vendors to adapt because people cannot keep up. Providing access to the latest generation hardware capabilities, while preserving support for previous generation systems, is necessary. Innovative approaches that can map the right software component to the most performant hardware device in a heterogenous environment, be it a CPU core or GPU, are becoming mandatory. These software needs are a requirement. They are not optional for the industry. Heterogenous distributed computing is here to stay, so hardware vendors must adapt. Intel's recent launch of the 4th gen Intel® Xeon® Scalable processor (CPU), Intel® Xeon® CPU Max Series and Intel® Data Center GPU Max Series demonstrates how rapidly hardware technology is progressing, which further emphasizes the importance of software.
Programmers Want to Use Python for the Entire Workflow
This is why Intel is working on libraries of distributed data structures. We want to embed all the hierarchical, distributed, and heterogenous complexity inside one or more libraries of distributed data structures. The beauty of this approach is that the library abstraction supports both oneAPI programmers and Python programmers as they tackle the challenges of heterogenous distributed computing. The astute reader will note that a library-based approach fits perfectly with the Python ecosystem.
Collaborations and new thinking go hand-in-hand to make such libraries possible. We are working within the graphBLAS forum to define application programming interfaces (APIs) for graph algorithms expressed in terms of linear algebra over sparse matrices.[3] These approaches are used, for example, by the ExaBiome team at UC Berkeley to speed their many-against-many protein similarity searches to Gordon Bell levels of performance.[4] The approach is general as graph operations such as Single Source Shortest Path (SSSP) can be expressed as a matrix multiply followed by a merge with a MIN() operation. This cool trick uses algebraic semirings, which gives us a way to swap the operators. It is possible to see the wave of activity propagating through the matrix!
Meanwhile, we want to increase the parallelism of Python applications throughout the entire workflow. We believe incorporating OpenMP in the Python programming model will be a real benefit. Our research project differs from the current approach, which embeds OpenMP pragmas in the source code of each and every library. Instead, we want to incorporate OpenMP into Python so programmers can naturally express the parallelism of their problem. Think in terms of a high-level Python program that uses numba (a Python Just-In-Time compiler) to map Python code onto LLVM from which we can use the LLVM-OpenMP runtime to exploit multithreaded parallelism; all with minimal programmer effort. This is an exciting project with potential to realize extremely high performance. My personal target is to realize 80% of native, compiled code performance.
What is Next?
If you look at the future of programming, it is important to recognize that we are clearly ensconced in, and moving deeper into the age of hierarchical heterogenous distributed computing. People cannot keep up with the rate of hardware innovation and the ensuing complexity. This places the burden on software.
Performance results demonstrate that software innovations, particularly in middleware numerical libraries, are bridging the gap so programmers can realize extreme HPC performance. These libraries are callable from Python. Performance portable approaches such as oneAPI and modifications by the software teams of most major scientific libraries are currently connecting Python and native code programmers with new accelerators, extended ISAs, and other innovations.
Intel is excited to be part of this transition into the age of hierarchical, heterogenous, distributed computing. Even better, we see that our current approaches will support the next generation of scientific programmers who will utilize Python as their primary programming language.
But there is more.
Thinking far outside the box, we believe we have to move beyond the LLVM toolchain, compilers, and libraries. For this reason, we at Intel are looking into machine programming (where the machine writes the program) and its exciting potential to address the complexities of hierarchical, heterogenous, distributed computing. Machine programming seems like a fantastic way to confront many performance issues that are associated with distributed data structures, including addressing NUMA, MPI communications, and other issues needed to connect data with the best computing device. To better understand what this means, please see our position paper, The Three Pillars of Machine Programming.
Learn more about Intel’s contribution to open standards and oneAPI here.
[1] http://sc16.supercomputing.org/presentation/index-id=gb101&sess=sess174.html
[2] https://www.computer.org/csdl/magazine/cs/2021/04/09500091/1vBD3mNgmU8
[3] https://people.eecs.berkeley.edu/~aydin/HipMCL_PreExascale-IPDPS20.pdf
[4] https://www.exascaleproject.org/highlight/exabiome-gordon-bell-finalist-research-infers-the-functions-of-related-proteins-with-exascale-capable-homology-search/
Notices and Disclaimers
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.