KNL omp thread sequential startup

TimP · ‎11-08-2016

When running a KNL OpenMP job with VTune locksandwaits (sampling interval set to 2), the threads are shown as starting at 3 millisecond intervals. In the case where threads are set by KMP_HW_SUBSET=64c,1t threads 1 through 62 start up over a time period of 200 milliseconds and wait (after a tiny execution interval) until all 62 are available. Even after the threads leave the initial wait state, there appears to be about a 90 millisecond delay from first to last worker thread, until they are forced to synchronize at a barrier (so that timing across that barrier may show an extra delay of 90 milliseconds).

While this is a much smaller overhead than was observed on KNC (presumably due largely to the KNC needing more threads), it appears to mean that full performance can't be reached for a job which doesn't run several seconds. I suppose this is to be expected, but some customers are disappointed. Some may like the effect of super-linear scaling for a job whose size increases, as the delay occurs only at the beginning.

VTune locksandwaits takes on the order of 2 seconds before worker thread creation begins, so it leaves some doubts in the mind of the beholder about the extent to which VTune may affect performance. [advanced-]hotspots seem to have a more reasonable overhead.

TimP · ‎11-08-2016

Evidently, for some purposes, the tactic of putting in a preliminary thread pool warm-up and excluding it from performance reporting (or counting only job repetitions after the first) may be valid.

James_C_Intel2 · ‎11-08-2016

Yes, obviously we'd all like there to be no overheads for anything. (I have a very nice fluffy pink unicorn for sale if you are interested.)

However, overheads exist and can't be completely eliminated. If you are really concerned about the performance of codes that run for 1s, I have to ask you a few questions :-

How long does the shell (or Python) script that sets up for the code take to run?
If you're not using a script, how long does it take to type the command to run the code?

Both of those times are also serial overhead which affects the overall time to solution.

Unfortunately there's not much that we can do in the OpenMP runtime to reduce thread-creation cost, since there is significant serialization inside the kernel.

TimP · ‎11-08-2016

James C. (Intel) wrote:

Yes, obviously we'd all like there to be no overheads for anything. (I have a very nice fluffy pink unicorn for sale if you are interested.)

However, overheads exist and can't be completely eliminated. If you are really concerned about the performance of codes that run for 1s, I have to ask you a few questions :-

How long does the shell (or Python) script that sets up for the code take to run?

If you're not using a script, how long does it take to type the command to run the code?

Both of those times are also serial overhead which affects the overall time to solution.

Unfortunately there's not much that we can do in the OpenMP runtime to reduce thread-creation cost, since there is significant serialization inside the kernel.

Thanks, I think you're confirming that this OpenMP startup behavior is to be expected, and that VTune is correct in reporting thread creation as serial time.

jimdempseyatthecove · ‎11-08-2016

>>Unfortunately there's not much that we can do in the OpenMP runtime to reduce thread-creation cost

Possibly you can. If the current implementation has the main thread creating the entire thread pool, you could change this to a binary tree type of thread pool creation. Any decent O/S implementation should permit a high degree of concurrency within the O/S. For example, while heap allocation may be serialized, memory wipe (if performed) and/or portions of VM mapping can be concurrent. BTW, I think TBB creates its thread pool in this manner.l

Jim Dempsey

James_C_Intel2 · ‎11-08-2016

you could change this to a binary tree type of thread pool creation.

Been there, done that, (and with other branching ratios). It didn't give useful improvement.