Inconsistent Behavior of OpenMP on Windows 7

Allen_Barnett · ‎01-26-2011

We have a numerical program which makes extensive use of OpenMP. There are many long-running, parallelized loops in the code. When this program is run on Windows 7, it will usually consume as many cores as there are in the box for the duration of the calculation. However, if you keep running it over and over, once out of every six runs (or so, it varies), one particular parallel loop of the code will not use all the cores; it appears to be using only one thread. We are trying to figure out what is causing this behavior. Do you have any suggestions as to what we can do to diagnose it? We're using ifort 11.0.067. Also, OpenMP dynamic threads are enabled with omp_set_dynamic( .true.).
Thanks,
Allen

jimdempseyatthecove · ‎01-27-2011

Are your parallel regions nested and/or using nowait?

Jim Dempsey

Allen_Barnett · ‎01-27-2011

It is definitely not a nested parallel region. I don't know about "nowait". Is that something which must be specified in the OMP directives?
Thanks,
Allen

jimdempseyatthecove · ‎01-28-2011

Allen,

>>It is definitely not a nested parallel region

Are you sure?
Add some code to assert the assumptions.

In the subroutine that contains the parallel do, insert the following before the parallel do

IF(OMP_IN_PARALLEL()) THEN
WRITE(*,*) "Break Here" ! put break here
ENDIF
!$OMP PARALLEL DO
...

The reason you might get in there is your application may recursively call the subroutine containing the parallel region. The test code will confirm/disclaim this.

The other thing to do is insert inside the parallel region

IF(.NOT. OMP_IN_PARALLEL()) THEN
WRITE(*,*) "Break Here" ! put break here
ENDIF

IF(OMP_GET_NUM_THREADS() .LE. 1) THEN
WRITE(*,*) "Break Here" ! put break here
ENDIF

These are sanity checks (asserts). Remember to remove them later.

Jim Dempsey

Martyn_C_Intel · ‎01-28-2011

Nowait is something that could be added to an OpenMP "DO" directive (it is not the default). It removes the impled barrier at the end of the loop, so subsequent code or a subsequent loop could start executing before the first one has completed.

Does your system have hyperthreading? You might try setting KMP_AFFINITY, e.g. to physical or scatter, so that you don't schedule more than one thread per physical core until necessary.

Most likely is that the behavior you observe istriggered by thedynamic scheduling. Does it go away when you disable dynamic scheduling? Is there significant other activity on your system, that might trigger a reduction in the number of threads?
Intel has a tool, Intel VTune Amplifier XE, which can be used to monitor activity system-wide, and to investigate thread behavior, barriers, load balance, etc within your application. (The threading part ofthis used to be in a separate, standalone tool called Thread Profiler).

Allen_Barnett · ‎01-28-2011

Hi Jim: Thanks for the tips. We will investigate your suggesions.

What is the effect of nested parallel regions?

Allen_Barnett · ‎01-28-2011

Hi Martyn: Thank you, too, for the tips.

Indeed, it does appear that if we turn off dynamic scheduling, the problem goes away. Not only that, but the whole code runs faster with dynamic scheduling turned off than it does with dynamic scheduling turned on.

I'm relying on a report from a user; he says that nothing else obvious is running. Is there something on windows 7 which wakes up periodically which would interfere with the dynamic scheduling determination of the number of threads or cores available? Would this show up under VTune Amplifier?

I believe that hyper threading is turned off. But we can try KMP_AFFINITY, too.

jimdempseyatthecove · ‎01-28-2011

Typically you use nested parallel regions when the outer nest level(s) is NOT using the full compliment of threads.

If the outer level is using all the threads, and you enter a nest levelall other threads than the current thread will be busy and depending on the scheduling you may end up withiteration space divided up by 1. (as opposed to n threads).

Nesting isan important feature, but you need to knowhow to best use it.

Example: assume very large matrix

halfThreads = omp_get_num_threads() / 2
if(halfThreads .eq. 0) halfThreads = 1
!$omp parallel donum_threads(halfThreads )
do iRow = 1, nRows
!$omp parallel do
do iCol = 1,nCols
call FOO(iRow,iCol)
end do
!$omp end parallel do
end do
!$omp end parallel do

The above assumes static scheduling

With 4 threads, you get 4 quadrants of the matrix.

Jim Dempsey

Martyn_C_Intel · ‎01-28-2011

Well, all sorts of security stuff keeps waking up on my Windows7 laptop. But I wouldn't expect that to be a problem in a lab or production system. Yes, VTune Amplifier would show if other processes were taking a lot of resources. However, this is just the sort of circumstance which dynamic scheduling is supposed to help.

Dynamic scheduling is part of the OpenMP standard, sothe APIhas been implemented. But I suspect it's not widely used and hasn't been tuned much. That your app runs faster without it is already an indication that it's not needed. There are environment variables you could try tuning, KMP_DYNAMIC_MODE, KMP_BLOCKTIME and KMP_LIBRARY, described in the user and reference guide. But my advice is to keep it simple and don't try to use dynamic scheduling, since you don't have a clear need for it. You could find yourself investing time and effort without getting anything useful out of it.

Later: From the documentation for KMP_DYNAMIC_MODE, it appears to me that true dynamic scheduling with load balancing may only be implemented on Linux. I don't think there's any reason for you to use it on Windows, especially as you say your app fully occupies 6 threads most of the time.

Martyn_C_Intel · ‎01-31-2011

I'm told that dynamic scheduling with load balancing is implemented for Windows in the latest Intel compilers, (for example, in version 12.0, contained in the Intel Visual Fortran Composer XE product), but not in the version 11.0 compiler that you are using. The developers tell me that the overhead of measuring the load shouldn't be very high, but that there may be quite a delay (seconds) between a change in the load and the OpenMP library detecting and reacting to it.
So for your present compiler, the advice is still not to use dynamic scheduling. But if you feel you have a real need for this, you might try updating to a newer compiler.

Allen_Barnett · ‎02-01-2011

Hi Martyn: Thank you so much for the feedback. I think we'll stick with 11.1 and dyanmic threads off for now. That yields the best performance.
Thanks,
Allen

Allen_Barnett · ‎02-01-2011

Hi Jim: Thank you so much for your help. I appreciate your explanations.
I think the upshot of our problem is that our parallel regions are so long running, on the order of dozens of seconds, that any perturbation of the dynamic thread selection at the start of the region has a large impact on the overall parallel efficiency of the program. As I mentioned to Martyn, setting omp_dynamic to false appears to result in the best performance.
Thanks,
Allen