I'm using Intel VTune to analyze my parallel application.
As you can see, there is an huge Spin Time at the beginning of the application (represented as the orange section on the left side):
It's more than 28% of the application durations (which is roughly 0.14 seconds)!
As you can see, these functions are _clone
, start_thread
, _kmp_launch_thread
and _kmp_fork_barrier
and they look like OpenMP internals or system calls, but it's not specified where these fucntion are called from.
In addition, if we zoom at the beginning of this section, we can notice a region instantiation, represented by the selected region:
However, to my knowledge I never call initInterTab2d
and I have no idea if it's called by some of the labraries that I'm using (especially OpenCV).
Digging deeply and running an Advanced Hotspot analysis I found a little bit more about the firsts unkown functions:
And exaplanding tthe Function/Call Stack tab:
But again, I can't really understand why these functions are called, why they take so long and why only the master thread works during them, while the others are in a "barrier"/waiting state.
I attached part of the code, if it can be useful.
Notice that I have only one #pragma omp parallel
region, which is the selected section of this image (on the right side):
The code structure is the following:
- Compute some serial, non parallelizable stuff. In particular, compute a chain of blurs, which is represented by
gaussianBlur
(included at the end of the code). cv::GaussianBlur
is an OpenCV function which exploits IPP.
- Start the parallel region, where 3
parallel for
are used
- The first one calls
hessianResponse
- A single thread add the results to a shared vector.
- The second parallel region
localfindAffineShapeArgs
generates the data used by the next parallel region. The two regions can't be merged because of load imbalance.
- The third region generates the final result in a balanced way.
- Note: according to the lock analysis of VTune, the
critical
and barrier
sections are not the reason of spinning.