- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using VTune to profile some Python codes for getting "GFLOPS".
The baseline code is written with Numpy and the optimized code is in Numba (with @njit and @vectorize decorator). The Numba code is about 8 times faster than the Numpy baseline, however, vTune shows that Numpy and Numba achieve the same "GFLOPS".
I just want to make sure that can the latest vTune report "GFLOPS" correctly for Numba Python code or not?
Is there any benchmark or example code about profiling Python Numba with vTune?
Thanks and regards
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Reply for testing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The coming update 1 release of Parallel Studio will contain better support for Numba profiling but it is rather related to how Numba code is displayed and referred to in VTune. I'll leave for others to comment how GFLOPS metrics works in VTune but I can explain the difference in performance between Numba and Numpy. Numba fuses all the vectorized operations into a single loop over a data, so it does not need to store intermediate results to memory and get them back for another operations which Numpy usually does. So, they work at the same rate with the memory, but Numba is much more efficient with respect to the number of memory operations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Anton Malakhov (Intel) wrote:The coming update 1 release of Parallel Studio will contain better support for Numba profiling but it is rather related to how Numba code is displayed and referred to in VTune. I'll leave for others to comment how GFLOPS metrics works in VTune but I can explain the difference in performance between Numba and Numpy. Numba fuses all the vectorized operations into a single loop over a data, so it does not need to store intermediate results to memory and get them back for another operations which Numpy usually does. So, they work at the same rate with the memory, but Numba is much more efficient with respect to the number of memory operations.
Thanks very much for your reply.
I have another question about profile Python (NumPy-based) with vTune. When VTune calculate and report GFLOPS, which counts kernel computing time and API calls time or just kernel computing time? Can VTune report the total number of how many flop in a code? (I think the number of flop is able, but I am sorry that I have no idea how to check it)
Thanks again.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page