I profiled my C++ application program which does parallelize using openmp.
As I wrote in the title, Vtune shows very high number of instructions retired on '__kmp_wait_template<kmp_flag_64>'.
(It's top-ranked one...)
So I think the CPU resources are wasted in my code
What does the function '__kmp_wait_template<kmp_flag_64>' exactly do?
Does it mean that there are huge workload skew between the threads?
You may have noticed that a bug resembling this was present in libiomp5 prior to the 16.0.1 compiler release. So, it would help if you would state your version or supply more information.
This is the time that was spent in runtime in spinning on barrier (imbalance) or waiting for parallel work by worker threads.
You can find this classification in the column on the right (pink cell).
I would recommend either to use "Analyze OpenMP Regions" knob on analysis type or use HPC Performance Characterization where it is done by default to learn about OpenMP use efficiency by the application per lexical OpenMP regions.
Thanks & Regards, Dmitry