Using omp_set_dynamic(.true.)

Andrew_Smith · ‎11-05-2014

My Fortran program has one parallel region which it stays in most of the time and uses openMP TASK extensively.

Since my program can take several minutes to run I wish to yield some processing to the user to allow him to go about other tasks on his workstation while he waits for my program.

So I call omp_set_dynamic(.true.) and mkl_set_dynamic(1) before I enter my parallel region. When using MKL, it is called only by my master thread and there are no other tasks running.

How often does the openMP and MKL adjust their number of threads used ?

jimdempseyatthecove · ‎11-05-2014

The dynamic settings (omp and mkl) do not relate to interactions with other processes (programs) running on your system, rather the relate to other parallel regions within your program that may be running concurrently. The usual procedure is to reduce your thread pool by one or a few threads. A different route is to use the KMP_... settings to select one thread per core (on host with HT) leaving the other threads in the core available to other programs. You will have to experiment as to what works best. On a larger SMP system it is not unusual for an application to restrict itself to less than all the available processors.

Jim Dempsey

Andrew_Smith · ‎11-05-2014

Yes, I agrea reducing the number of threads used by one or two in your number crunching application does help you to do other things on your computer. But the user is required to put some thought into running the application each time he uses it or he will may loose some performance. This is too much to ask of the average computer user.

The automatic setting should be able to automatically reduce the threads when neccessary. It should detect when the computer is busy and use less threads and then increase to full thread use when all cores are available. My question was related to this feature which you seam to be unaware of.

jimdempseyatthecove · ‎11-05-2014

There is no automatic load balancing feature. You will have to instrument your code to see if/when a load imbalance occurs (as a result of preemption by other program). When observed, then reduce the thread count on your next parallel region. A way of doing this is to create an array of __int46 variables that will receive the last __rdtsc() value of each thread, indexed by OpenMP team member number (misnamed as thread number). After the parallel region, the instantiating thread (normally main thread except when nested and non-team member 0 instantiates nested region), examines the termination time. When there is a substantial skew in the termination times, of one or more threads, this can be indicative of other activity on the system (this can also occur when the load is imbalanced). At this point, you can elect to reduce the thread count for the next time you enter a parallel region.

Jim Dempsey

TimP · ‎11-05-2014

The run-time library for Cilk(tm) Plus and TBB is intended to handle the case of varying numbers of available thread contexts better than OpenMP does. This comes at a significant performance penalty for many of the cases where OpenMP excels, typically limited to cases where the OpenMP job is given exclusive use of a specified group of cores.

I don't find much documentation on what omp_set_dynamic is expected to do in practice. The analogous mkl_dynamic is better documented, in that it enables (by default) automatically choosing an appropriate number of threads within MKL, not exceeding the number of cores (on Xeon host). The number would be expected to be chosen according to problem size parameters, not according to level of competing work load.

If you must run competing jobs with OpenMP without reserving specific cores to each job, leaving affinity unset, reducing number of threads, and letting the OS scheduler do the work is better than potentially forcing threads to run on the same core. As Jim hinted, threads which happen to be forced to share resources will take significantly longer than the luckier threads.

The tradition is slightly different for MPI. Several major MPI implementations have adopted the tactic of Intel MPI in setting affinity with ranks and threads appropriately distributed by default (including use of OpenMP affinity). This default has to be over-ridden when running competing tasks. MKL documentation states that MKL chooses single thread mode when running with an MPI which it doesn't understand.