I'm using Intel VTune to analyze my parallel application.
As you can see, there is an huge Spin Time at the beginning of the application (represented as the orange section on the left side):

It's more than 28% of the application durations (which is roughly 0.14 seconds)!
As you can see, these functions are _clone, start_thread, _kmp_launch_thread and _kmp_fork_barrier and they look like OpenMP internals or system calls, but it's not specified where these fucntion are called from.
In addition, if we zoom at the beginning of this section, we can notice a region instantiation, represented by the selected region:
However, to my knowledge I never call initInterTab2d and I have no idea if it's called by some of the labraries that I'm using (especially OpenCV).
Digging deeply and running an Advanced Hotspot analysis I found a little bit more about the firsts unkown functions:

And exaplanding tthe Function/Call Stack tab:
But again, I can't really understand why these functions are called, why they take so long and why only the master thread works during them, while the others are in a "barrier"/waiting state.
I attached part of the code, if it can be useful.
Notice that I have only one #pragma omp parallel region, which is the selected section of this image (on the right side):

	The code structure is the following:
	
		- Compute some serial, non parallelizable stuff. In particular, compute a chain of blurs, which is represented by gaussianBlur(included at the end of the code).cv::GaussianBluris an OpenCV function which exploits IPP.
- Start the parallel region, where 3 parallel forare used
- The first one calls hessianResponse
- A single thread add the results to a shared vector.
- The second parallel region localfindAffineShapeArgsgenerates the data used by the next parallel region. The two regions can't be merged because of load imbalance.
- The third region generates the final result in a balanced way.
- Note: according to the lock analysis of VTune, the criticalandbarriersections are not the reason of spinning.