What is _kmp_fork_barrier and how to see if there is load imbalance?
I'm using Intel VTune Amplifier to see how my parallel application scales.
It scales pretty well on my 4-cores laptop (considering that there are portions of the algorithm that can't be parallelized):
However, when I test it on the Knights Landing (KNL), it scales horribly:
Notice that I'm using only 64 cores on purpose.
Why there is so much idle time? And what is _kmp_fork_barrier? Reading about "Imbalance or Serial Spinning (OpenMP)" it seems that this is about load imbalance, but I'm already using schedule(dynamic,1) in all omp regions.
How can I see if this is actually load imbalance? Otherwise, what could be a possible cause?