On a single core system with HyperThreading the KMP_BLOCKTIME (unexpired) would be doing something like a SpinLock which could functionaly called a SpinGo. To accomplish thisthe stalled threadmust be looking at a shared memory variable and inorder to force cache coherency with the other processors (the other HT thread in this case) an instruction is issued (LOCK?) that forces all processors to invalidate there cache so as all can see the potential change in the value of the variable used for the "SpinGo".
There are two opposing forces in effect. The waiting thread wants to get going as soon as possible. In which case it performs a "Are we there yet, are we there yet, ..." such that to get the answer as soon as possible. The other opposing force the the other thread that is trying the execute to the synchronization point and flag "we are there now". But in the process to getting "there" it's cache keeps getting flushed due to the activities of the impatient thread.
On way to fix this is to reduce the frequency of poling the flag.
The question I have is which way is implimented?
The reason I ask this is OpenMP on a single core HT processor runs significantly slower than a single thread. From the literature it would seem that some improvement would be expected (10%-20% depending on applicaiton).
In the IA-32 Intel Archetecture Optimization Reference Manual 7-1 advises to insert PAUSE in spin wait loops. With IVF and OpenMP is PAUSE inserted? The reason I ask is on a P4 530 with HT there is a significant hit (25% to 50%)when running multiple threads with unbalanced workloads. e.g. one thread processing array with 1000 elements while other thread processing different array with 500 elements.
There may be more than one effect here. I'm not an expert on the spin waits, while some who read this should be. I believe the Intel OpenMP library loops a little while before issuing a PAUSE. If you can submit an example showing a need for improvement, that might be useful. I ran into a problem where I showed the Intel 9.0 vectorization was not always good in parallel regions, and we expect this to be fixed in the 9.1 compilers. Currently, this issue can easily be severe enough that a build with OpenMP off will out-perform an OpenMP build running with HT.
I will have to step into the asm code of the loop using the debugger. I thought someone here would know the answer. Itwould beinteresting the the Intel Optimization Reference manual strongly recommends using the PAUSE while(if) their development team ignores the recommendation. Thiswould be a shame because theuse of PAUSE can be free
---- to ----
Then there is no penalty to having a PAUSE in the event that (sync_var == constant_value)
Sorry for the insertion problem on the prior message post. The Reply to Message form on this forum has too short of a timeout (you cannot get distracted with a phone call) if it times-out you loose your message. So at times I type in haste.