Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
649 Discussions

Boost the Performance of AI/ML Applications using Intel® VTune™ Profiler

Nikita_Shiledarbaxi
0 0 2,364

Authors:

Nikita Shiledarbaxi, Software Technical Marketing Engineer, Intel

Rob Mueller-Albrecht, Software Tools Marketing Manager, Intel

Learn how the oneAPI-powered tool helps profile Data Parallel Python* and OpenVINO™ workloads 

 

The field of AI and Machine Learning (ML) continues to expand the realm of its applications across healthcare and life sciences, marketing and finance, manufacturing, robotics, autonomous vehicles, smart cities, and several other industrial spheres.  Deep Learning frameworks such as PyTorch*, TensorFlow*, Keras*, and others form the backbone of Machine Learning (ML) workloads employed in these real-world domains.

Other developer resources like the OpenVINO™ Toolkit also contribute to accelerated AI development on the latest hardware architectures in areas such as Computer Vision and Generative AI (GenAI) through the ‘write once, deploy anywhere’ strategy. Since its inception in 2018, the open source OpenVINO Toolkit has been dedicated to accelerating AI inference withlower latency and higher throughput while maintaining accuracy, reducing model footprint, and optimizing hardware use.

The complex structure of deep learning models involving non-linear functions and multiple layers makes it difficult to identify and analyze performance bottlenecks in the underlying source code. ML frameworks such as PyTorch and TensorFlow provide profiling APIs and native tools for measuring and analyzing performance metrics during different stages of the model development. However, these methods have a scope limited to software functionality. The oneAPI-powered Intel® VTune™ Profiler addresses this challenge by providing deep insights into computational and memory bottlenecks at the hardware level. This helps fix performance issues and optimize and scale the performance of AI applications across hardware platforms with varying computational envelopes.

In this blog, you will learn how Intel VTune Profiler can help profile data parallel Python and OpenVINO applications, increasing the scope of optimization for AI/ML workloads.

 

Enhance the Performance of Python* Applications with Intel® VTune™ Profiler

 

A newly released recipe in the Intel VTune Profiler Cookbook illustrates how VTune Profiler can help profile a Python application. Let us dive deeper into the example of pairwise distance calculation using the NumPy* library discussed in the cookbook.

The basic software requirements of the recipe include:

Note: For detailed hardware and software requirements, please consult the recipe configuration details.

The NumPy implementation discussed in the recipe uses the Intel® oneAPI Math Kernel Library (oneMKL) routines for distance computations and the Intel® Instrumentation and Tracing Technology (ITT) APIs for dividing the computations into logical tasks. The VTune Profiler tool can then help analyze execution time and memory consumption of each logical task, enabling you to decide which parts of the code to focus on for needed modifications to achieve extra performance.

When Hotspots analysis is run on the NumPy implementation, the output analysis report provides details about the most CPU time-consuming code sections. It also gives suggestions to explore other performance analysis capabilities of the profiler tool such as Threading analysis for increased parallelism and Microarchitecture Exploration analysis for efficient usage of the underlying hardware.

 

Fix Performance Bottlenecks with Data Parallel Extension for NumPy* and Numba*

 

In the basic NumPy implementation of the pairwise distance calculation example, NumPy operations and underlying oneMKL functions consume a large proportion of the total execution time, as per the Hotspots analysis report. These bottlenecks can be resolved by replacing NumPy with the Data Parallel Extension for NumPy through minor code changes. Run the Hotspots analysis again to see the performance improvements over the basic NumPy code and identify any room for further optimization.

The VTune Profiler also gives suggestions like adding offload accelerator parallelism to the application using the Data Parallel Extension for Numba* with your platform’s GPU. Numba is an open-source extension of the Numba JIT compiler for NumPy operations. It provides SYCL*-like APIs for kernel programming in Python. The Numba implementation execution on a GPU can then be analyzed using the GPU Compute/Media Hotspots analysis preview feature of VTune Profiler.

 

 For more details, check out the 'Profiling Data Parallel Python Applications'
recipe in the VTune Profiler cookbook.

Note: The sample code discussed in the cookbook is for illustration purposes. You can similarly leverage the VTune Profiler tool to analyze and optimize the performance of any Python application of your choice.

 

Analyze the Performance of OpenVINO™ Applications with Intel® VTune™ Profiler

 

Another new recipe in the VTune Profiler cookbook discusses profiling OpenVINO-based AI applications using the VTune Profiler. It talks about analyzing CPU, GPU, and Neural Processing Unit (NPU) performance bottlenecks using the profiler tool.

Note: For hardware and device driver requirements, please consult the recipe configuration details.

The recipe provides step-by-step instructions to set up OpenVINO, build the OpenVINO source and configure OpenVINO with the ITT APIs for performance analysis. It uses a reference benchmark application for analyzing latency and throughput while profiling the AI application.

Based on the compute architecture used, you can utilize various performance analysis functionalities of the VTune Profiler to identify hotspots and examine the hardware usage by different code sections. For instance,

  • Use the Hotspots Analysis feature for analyzing CPU bottlenecks, i.e., the code parts that consume the highest amount of CPU execution time.

  • Profile GPU hotspots using the preview feature GPU Compute/Media Hotspots Analysis. Understand GPU utilization by exploring inefficient kernel algorithms, analyzing GPU instruction frequency for different types of instructions, and more.

  • The Neural Processing Units (NPUs) in AI PCs are specifically designed to achieve performance improvements in AI/ML applications. You can offload compute-intensive AI/ML workloads to Intel® NPUs using the Intel® Distribution of OpenVINO™ Toolkit. The NPU Exploration Analysis preview feature of the VTune Profiler can help you analyze the NPU performance based on various hardware metrics such as workload size, execution time, sampling interval, and more.

 

 Refer to the 'Profiling OpenVINO Applications' recipe in the VTune Profiler 
cookbook for detailed information on leveraging the oneAPI-powered tool for
profiling AI/ML applications of your choice.

 

What's Next?

 

In addition to the performance bottleneck profiling and analysis features described in the VTune Profiler cookbook recipes discussed in this blog, the tool enables memory consumption and allocation analysis, I/O performance analysis, HPC performance characterization analysis, and much more. Get started with VTune Profiler today!

We encourage you to check out other AI, HPC, and rendering tools in our oneAPI-powered software portfolio.

 

Get The Software 

 

The Intel VTune Profiler is available as a part of the Intel® oneAPI Base Toolkit. You can also download a standalone version of the tool. 

 

Additional Resources

 

 

About the Author
Technical Software Product Marketing Engineer, Intel