am using ifort 11.1.046 on Linux Intel64. Just starting to add OpenMP to parallelize simulation code I have: The behavior that confuses me is this: If I only parallelize a minor component of the code (<5% total runtime) using a fixed number of for example four threads, I would expect that the load for execution of the entire program is around 100%(0.95*1 + 0.05*4). Instead, it is consistently at 400%. Load reporting using "top" and automatic fan adjustment agree with this assessment. Timing results (using 'time' or the built-in subroutine CPU_time) all report total user times exactly four times as high as the real execution time. However, the real-time speed-up compared to serial execution is of course negligble. What am I missing?
Might your serial intervals be so short that the default settings of KMP_BLOCKTIME keep the threads active?
thanks, that's extremely helpful!
i mocked with KMP_BLOCKTIME (after at least reading a bit of the doc.) and it does make a difference. the execution duration of PARALLEL segments in this example is in fact very short <1 ms. the serial interval is ~4ms. KMP_BLOCKTIME seems to default to 200ms which would explain the behavior exactly. when i set KMP_BLOCKTIME to 1ms, the load is still too high but signficantly less than the max (and it fluctuates on the timescale it is being recorded on). in the case the thread doesn't go to sleep, though: what is it doing? and why is it showing up as load? and what is the point of this deadtime? i assume it would be meant to prevent overhead but in my example performance is unaffected by KMP_BLOCKTIME settings ranging from 1ms to 2s.
I suppose the threads are in a spin lock wait loop until they time out, with the objective of accelerating resumption of working threads, and facilitating maintenance of core affinity. I don't know whether the KMP_BLOCKTIME waits may be accounted separately, if you run with openmp_profile. You can do that either by re-linking, or, for a default linux dynamic linked openmp library, by setting the profiling shared object in LD_PRELOAD. This collects statistics on parallel regions and writes them in the file guide.gvs.