Extreme fine-grained parallelism on modern many-core architectures using oneAPI

Anshul_Gupta · ‎07-05-2023

Background

I am Poornima Nookala, a Software Research Engineer/Scientist at Intel. I joined Intel in January 2023 after completing my PhD in Computer Science at the Illinois Institute of Technology. Prior to my PhD, I worked in the industry for 8 years as a software engineer. My research interests include parallel programming models and runtime systems for extreme-scale supercomputing systems, computer architecture, cloud computing systems, and, big-data computing. I am particularly interested in bridging the gap between software and hardware layers for enabling both functionality and performance, as well as questioning assumptions made by the software stacks we use today in a rapidly evolving hardware landscape.

What it's like being an Intel Student Ambassador for oneAPI?

It was my privilege to be one of the first oneAPI Student Ambassadors. I was referred to the program by my PhD advisor, who recommended me to apply for the program. As a oneAPI student ambassador, I was provided with excellent technical support through the use of training materials from Intel during my one-year term. I was able to interact with fellow SAs from all over the world regularly and learn about the exciting work they are doing with oneAPI.

My Project

Processors with 100s of threads of execution and GPUs with 1000s of cores are among the state-of-the-art in high-end computing systems. This transition to many-core computing has required the community to develop novel algorithms to overcome significant latency bottlenecks through massive concurrency. However, implementing efficient parallel runtimes that can scale up to hundreds of threads with extremely fine-grained tasks remains a challenge. My PhD dissertation titled “Extreme fine-grained parallelism on modern many-core architectures” explored task-based parallel programming in shared-memory and distributed-memory environments. Figure below shows the main components of a task-parallel runtime system, the scheduler and the task queues. I presented approaches to achieve lightweight tasking by reducing synchronization overheads in today’s shared memory parallel runtime systems and designed X-OpenMP as a prototype solution. We show that it is possible to implement a parallel execution model using lock-less techniques for enabling applications to strongly scale on many-core architectures. I also contributed to the development of Template Task Graph (TTG) which is a new flow graph programming model for high performance algorithms executable on distributed heterogeneous computer platforms.

How Intel oneAPI Tools helped?

I have been using the oneAPI tools since the beginning of my PhD project. A huge part of my work during PhD was understanding the bottlenecks in existing parallel runtimes that occur due to the overheads of management and synchronization. Intel® VTune™ Profiler tool has been monumental in identifying performance issues and areas of improvement in existing implementations of parallel runtime libraries like LLVM OpenMP and GNU OpenMP. Intel VTune can run various types of analysis for identifying bottlenecks in threading, memory access, synchronization including low level hardware metrics. The VTune GUI is very rich and intuitive, and it has helped me quantify overheads occurring due to load imbalance between threads, overheads of synchronization and cache memory accesses. Figure below is a timeline plot (threads on the y-axis) obtained from Intel VTune which shows load imbalance between the threads where some threads are busy (green) doing work and others are idle (black). Such insights obtained using Intel Vtune helped me develop the core idea behind my PhD thesis. We call it XQueue, a lockless queueing mechanism which can scale up to hundreds of threads of execution.

Intel Advisor is another useful tool part of the oneAPI toolkit which helps to identify bottlenecks at each function level. There was a lot of benefit to using this tool in identifying which lines of code are causing performance bottlenecks, and it suggested alternative approaches to improve performance as well. I also had the opportunity to attend several webinars on SYCL and DPC++ to learn about GPU programming. This enabled me to gain the knowledge required for a portion of my PhD work involving GPUs.

Overall, Intel oneAPI tools have been extremely useful during my PhD, and I really appreciate the training and resources I have received from Intel which helped me gain much better insights about the usefulness of these tools and apply them in my research.

Conclusion

After quitting my job earning six figures to pursue a PhD, my goal was to become a Research Scientist and make significant contributions to the field of Computer Science. When I was considering whether to work in academia, industry, or national labs, I met Tim Mattson from Intel. Through his life journey, he inspired me to pursue a career at companies like Intel. Also, I was fortunate to receive funding from Intel for a portion of my PhD study. It was through this PhD work and exposure to Intel oneAPI as a student ambassador that I learned about Intel's incredible work. In the last semester of my PhD, I applied for a position at Intel and received an offer. Currently, I work as a Software Research Engineer/Scientist in LTD's Design & Patterning AI (DPAI) group. I am looking forward to a bright future at Intel and making a difference in the world of Computer Science.