Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

openMP difference 32 vs 64-bit

Windyman
Beginner
1,459 Views

Dear all,

Not sure whether this is the right sub-forum but here is my problem:

I am compiling an intel fortran compiled program (w_comp_lib_2016.1.146) on my windows 7 machine with Visual Studio 2013 (I am compiling both a 32 and 64-bit version). The program contains OpenMP instructions to parallelize the most intensive do loop using:
        !$OMP PARALLEL  private(..)
        !$OMP DO
        do ....
          do ..
          ...
          endo                
            endo    
        !$OMP END DO
        !$OMP END PARALLEL

I am running the program on a high performance computer (Windows Server 2012R2) containing 2 nodes with each 20 CPU's (Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz). Where the 64-bit program uses all available CPU's from both nodes (as observed from the resource monitor), the 32-bit application only uses one node and thus is significantly slower. The RAM usage is limited to approximately 50Mb.

Any clue on what is causing the 32-bit application to use only 1 node? I could not find an answer after searching the forum.

Best regards

0 Kudos
7 Replies
TimP
Honored Contributor III
1,459 Views

As you are using ifort, the Windows Fortran forum may be more useful. 

It appears you have missed a few points in your forum searches.

1) "all CPUs" isn't a very suitable term if you mean all logical CPUs.  Most Fortran applications perform better with OMP_NUM_THREADS set to number of cores, and 'SET OMP_PLACES=cores'.  You seem to imply you haven't tried these settings.  If the settings aren't acceptable, you should try to disable HyperThreading.

2) default value of OMP_STACKSIZE is 2MB for 32-bit mode and 4MB for 64-bit.  I don't know a way in which default OMP_NUM_THREADS would be adjusted accordingly.  It's easy to run out of stack space when you run so many threads.  If you increase OMP_STACKSIZE, you may expect to require also a boost in the link stack setting.

3) (not directly related to your question) on account of the limited address space in 32-bit mode, ifort skips some of the optimizations which it performs at the same settings for 64-bit mode.  Some of these differences may show up when you set /Qopt-report:4 which you should have read about in your forum search.

4) It seems unlikely that anyone would want to run in 32-bit mode on such a platform.  If you wish to pursue this question, you may need to explain your reasons.

5) Your statement about 50Mb seems ambiguous.  If you do mean bits, that seems rather small.  How do you get this figure?

0 Kudos
Windyman
Beginner
1,459 Views

Dear Tim,
Thanks a lot for your comments! I have tried your suggestions:

1) The reason for not setting OMP_NUM_THREADS is that I was assuming it would default to the available cores. This works for the 64-bit but not for the 32-bit. Settting the number to 40 (which is what it uses in case of 64-bit), does enforce the nr of cores to 40 also in 32 bit, but the CPU usage is still very small on 1 of the 2 nodes (hence not improving performance). Setting OMP_PLACES to "cores" did not seem to change much.
2) Setting OMP_STACKSIZE to 16M does not change anything unfortunately (I did verify the value of the environment variable using an echo command).
3) Activating the /Qopt-report:4 option during linking did not reveal any differences between 32 and 64 bit for the OpenMP part. For both compilations "DEFINED REGION WAS PARALLELIZED" is reported.
4) The reason for wanting to run 32-bit is that in future I will have to compile the program as a library (DLL) as it has to be coupled to an external program which is unfortunately only available in 32-bit..
5) Sorry, I meant MB. So if I look in the task manager, the memory usage is approx. 50MB.

So I am still a puzzled, should I disabling hyper-threading?? I just noticed in the environment variables there also is the NUMBER_OF_PROCESSORS which is set to 20 only (I tried changing it to 40, but the system did not effectuate this change as verified by an echo statement).

Best regards

PS I could not find a way to move my post to the Fortran forum.

0 Kudos
McCalpinJohn
Honored Contributor III
1,459 Views

The "OMP_PLACES" environment variable is a relatively recent addition.  You should be able to control thread placement using the legacy "KMP_AFFINITY" environment variable.   There are several different ways to use KMP_AFFINITY to get the threads distributed across the two sockets, but I recommend a simple start:

KMP_AFFINITY=verbose,scatter

With the "verbose" option, when you run the job and it reaches its first parallel region, the OpenMP runtime will print out a full listing of where all of the logical processors are located (i.e., the socket, core, and thread context), followed by a full listing of the OpenMP threads and which logical processor(s) they are bound to.

With the "scatter" option, the threads will be spread as far apart as possible -- alternating between sockets, then interleaving across cores, then repeating the pattern but using the second thread context on each core.  E.g., for OMP_NUM_THREADS=20, you should get one thread on each core in each socket. 

The primary alternative to "scatter" is "compact", which reverses all three levels of the interleaving -- first alternate thread contexts, then cores, and finally sockets.  In this case for OMP_NUM_THREADS=20, you should get one thread on each logical processor in the first socket, and nothing in the second socket.

Once you are satisfied with the behavior you can remove the "verbose" clause.

There are lots of other approaches, but the "verbose" option to KMP_AFFINITY is the best way to be sure what the runtime library is doing.

 

0 Kudos
Windyman
Beginner
1,459 Views

Dear John,
Thanks very much for your suggestion. I have entered the KMP_AFFINITY variable as suggested. Unfortunately the scatter option did not give the desired result. However I was able to diagnose using the verbose clause:

For win32:
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,
9,10,11,12,13,14,15,16,17,18,19}
OMP: Info #156: KMP_AFFINITY: 20 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 10 cores/pkg x 2 threads/core (10 tot

For x64:
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #155: KMP_AFFINITY: Initial OS proc set not respected: {0,1,2,3,4,5,6,
7,8,9,10,11,12,13,14,15,16,17,18,19}
OMP: Info #156: KMP_AFFINITY: 40 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 10 cores/pkg x 2 threads/core (20 tot

I am suspecting that for win32 the the 2nd package is simply not available. So my hypotheis is that whatever I do my program can simply not access it in 32-bit. But possibly I can change a software (windows?) setting elsewhere to get what I want (Respecting initial OS proc set yes or no??). Or is that a different forum again? Suggestions welcome  from you black-belters out there..

0 Kudos
McCalpinJohn
Honored Contributor III
1,459 Views

Interesting results....

Notice that in the win32 case the output of KMP_AFFINITY includes "Initial OS proc set respected", while for the 64-bit version the KMP_AFFINITY message is "Initial OS proc set not respected".    I don't know why there should be a difference, but you can override the behavior in the win32 case by adding the "norespect" option to KMP_AFFINITY:

KMP_AFFINITY=verbose,scatter,norespect

This may not help (there may be some other reason that win32 only sees one socket), but it is a quick and easy test....

0 Kudos
Windyman
Beginner
1,459 Views

This is the result for setting norespect:

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #155: KMP_AFFINITY: Initial OS proc set not respected: {0,1,2,3,4,5,6,
7,8,9,10,11,12,13,14,15,16,17,18,19}
OMP: Info #156: KMP_AFFINITY: 20 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 10 cores/pkg x 2 threads/core (10 tot
al cores)

Indeed now the proc set is not respected but it does not have the desired effect..

0 Kudos
McCalpinJohn
Honored Contributor III
1,459 Views

Definitely looks like a Windows problem.  Hard to tell if it is a bug or a feature, but one would hope that an OS as recent as Windows Server 2012 would know how to handle multi-socket systems.... 

0 Kudos
Reply