- Intel Community

guo__jian · ‎10-31-2018

I am using VTune to profile some Python codes for getting "GFLOPS".

The baseline code is written with Numpy and the optimized code is in Numba (with @njit and @vectorize decorator). The Numba code is about 8 times faster than the Numpy baseline, however, vTune shows that Numpy and Numba achieve the same "GFLOPS".

I just want to make sure that can the latest vTune report "GFLOPS" correctly for Numba Python code or not?

Is there any benchmark or example code about profiling Python Numba with vTune?

Thanks and regards

guo__jian · ‎11-01-2018

Reply for testing

Anton_M_Intel · ‎11-08-2018

The coming update 1 release of Parallel Studio will contain better support for Numba profiling but it is rather related to how Numba code is displayed and referred to in VTune. I'll leave for others to comment how GFLOPS metrics works in VTune but I can explain the difference in performance between Numba and Numpy. Numba fuses all the vectorized operations into a single loop over a data, so it does not need to store intermediate results to memory and get them back for another operations which Numpy usually does. So, they work at the same rate with the memory, but Numba is much more efficient with respect to the number of memory operations.

guo__jian · ‎11-09-2018

Anton Malakhov (Intel) wrote:
The coming update 1 release of Parallel Studio will contain better support for Numba profiling but it is rather related to how Numba code is displayed and referred to in VTune. I'll leave for others to comment how GFLOPS metrics works in VTune but I can explain the difference in performance between Numba and Numpy. Numba fuses all the vectorized operations into a single loop over a data, so it does not need to store intermediate results to memory and get them back for another operations which Numpy usually does. So, they work at the same rate with the memory, but Numba is much more efficient with respect to the number of memory operations.

Thanks very much for your reply.

I have another question about profile Python (NumPy-based) with vTune. When VTune calculate and report GFLOPS, which counts kernel computing time and API calls time or just kernel computing time? Can VTune report the total number of how many flop in a code? (I think the number of flop is able, but I am sorry that I have no idea how to check it)

Thanks again.

Can VTune 2019 profile Python code with Numba decorators?