- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I profiled my program running on Xeon Phi in native mode via vTune and realized that a lot of time goes to __kmp_hierarchical_barrier_release. What does this normally imply? I know it must be some OpenMP issue, but have no idea how to solve it.
BTW, the same piece of code, whening running on Xeon, vTune tells that some significant portion of time (much less than the __kmp_hierarchical_barrier_release in Phi though) goes to __kmp_launch_threads.
Thanks in advance!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
The reason of the issue might be imbalance on an OpenMP barrier - threads are spinning in a busy wait burning CPU on OpenMP functions and they are goign to VTune hotspots.
I recommend to use /OpenMP Region/.. groupings to see CPU time in the region break down by classification that should help to understand what CPU time spent in OpenMP means (it is better to use VTune Amplifier XE 2015 Update 2 for this).
You can also look at https://software.intel.com/en-us/node/529832 help article for methodology on OpenMP tuning we offer in VTune.
Thanks & Regards, Dmitry
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page