[OpenMP] Huge overhead cost with OpenMP

Edgardo_Doerner · ‎10-01-2014

Hi to everyone,

I am working on a Monte Carlo code, written in Fortran77 in order to make it parallel using OpenMP. Now I am in the testing phase of the development process, but I am facing problems with the overhead costs of the code. For example, when I analice it using Vtune Amplifier XE I obtain the following summary:

Elapsed Time:   43.352s
Total Thread Count:   5
Overhead Time:   16.560s
Spin Time:   0.847s
CPU Time:   157.369s
Paused Time:   0s

Well, the systems complains that the overhead time is too much. What is worst is that I have tested this code against gfortran and this effects are less pronunciated using the later. This is sad, because without parallelization the code compiled with ifort is much faster than gfortran, but as I increase the number of OMP threads (maintaining the load per thread constant) the overheads costs render the ifort version slower than the gfortran one.

What I have found is that the threads get "stalled" in a very disorder fashion, for example, you can see this in the image bellow

The code has several subroutines that controls all the Monte Carlo simulation process (for example, random number generation, electron and photon transport, geometry description, etc). This subroutines communicate each other using COMMON blocks, therefore I have had to flag some of them as private using the THREADPRIVATE statement when needed. The idea is to maintain the original structure of the code as much as possible, considering that this is a wide used code and the idea is to offer an easy transition to parallelization with OpenMP without changing the core of the program.

I have created a small code that runs only the random number generator and use them to estimate the value of PI. This code has also the same problem as the original code. In this last one what have I found is that the function _kmp_get_global_thread_id_reg has a great part of the overhead time:

Well, I would really appreciate if someone has a tip to face this problem. I have tried to search info about this problem without success. Thanks for your help!!

Edgardo_Doerner · ‎10-01-2014

I did a mistake putting the images, here are the correct

and the second

jimdempseyatthecove · ‎10-02-2014

A few things:

1) This may be immaterial: Why to you have "0" inserted in your C$OMP0... directives?

2) C$OMP0DO SCHEDULE(dynamic,25000000)

DYNAMIC

Can be used to get a set of iterations dynamically. It defaults to 1 unless chunk is specified.

If chunk is specified, the iterations are broken into pieces of a size specified by chunk. As each thread finishes a piece of the iteration space, it dynamically gets the next set of iterations.

So unless your ncase approximates 25000000 * number of threads or more, work will be imballanced

Jim Dempsey

Edgardo_Doerner · ‎10-02-2014

Thanks for your answer, jim, now the answers:

1) the reason is a little bit obscure. The original MC code is really written in MORTRAN and then is pre-compiled to Fortran. I managed to create different MACROS to handle the OpenMP pragmas, but for some reason my version of MORTRAN cannot accept an empty space as argument for a MACRO call (the idea was to be able to add a * in case that I need to break a line containing a OpenMP pragma). Therefore, the "best" solution at that time was to use a "0" instead of an empty space.

2) The chunk size is calculated as NCASE / (# OMP THREADS * 10). The original program uses that expression to divide the amount of work between threads. Maybe I could choose a smaller chunk, but at least what I see from the timeline is that the main source of overhead/spin time is these strange peak pattern...

jimdempseyatthecove · ‎10-03-2014

In looking at your VTune charts it appears that only 2 threads out of 4 are doing any work, and that there work is not occurring at the same time (most of the time). This indicates than the amount of work per icase varies drastically causing some threads to wait long at the END DOs. It looks like you are using your own random number generator so I do not think you have any critical sections except for possibly the REDUCE which should not (with 4 threads) exhibit that much of overhead.

BTW, why not simply use C$OMP0CRITICAL or C$OMP0ATOMIC to perform hits = hits + count?

Try removing the SCHEDULE as a test to see what happens. IOW see if you can get a base configuration that works, then optimize (as opposed to starting with something that apparently is not working and then backing into something that works).

Jim Dempsey

jimdempseyatthecove · ‎10-03-2014

Another thing to shed light is just inside C$OMP0PARALLEL insert a

t0 = omp_get_wtime() ! t0 and t1 inside thread private common /score/

and just prior to exint of C$OMP0PARALLEL

t1 = omp_get_wtime()

C$OMP0CRITICAL

print *,"runtime ",omp_get_thread_num(), t1-t0

This may show you a discrepancy of work between threads. Which if there is, you can then find out why.

Jim Dempsey

Edgardo_Doerner · ‎10-04-2014

Hi Jim,

I measured the work time of each thread as you said but I found a minimal difference between them. I followed your advice and used atomic operation to record the total number of hits, thanks for the tip!.

Well, experimenting with a different RNG (Ziggurat - http://people.sc.fsu.edu/~jburkardt/f77_src/ziggurat_openmp/ziggurat_openmp.html) I implemented it with the example code that I gave above and it does not showed this problem, so at that moment it seemed that the problem was the RANMAR RNG.

However, by mistake I ran an Amplifier Concurrency Test without the recommended compilations flags as stated in the user manual (just Release mode) and the problem disappeared with the original generator. It is weird, because first the Ziggurat OMP does not suffer this problem but only the RANMAR RNG... but well, running the test with Release compilation options removed the problem... strange, not?

Anyway, at least I found another RNG to play with... I leave the code if you want to test it... thanks for your help!

jimdempseyatthecove · ‎10-05-2014

In taking a quick look at your attached .f file it appears that all threads will be using the same sequence of random numbers.

jsr = 123456789

...

seed = shr3( jsr )

Consider using

seed = shr3( ixor(jsr, (omp_iam +1) * 101010101))

Or something like that.

Jim Dempsey

Edgardo_Doerner · ‎10-05-2014

ups, I did not realized that, thanks for finding that error ;)

Edgardo_Doerner · ‎10-05-2014

Well, I still have overhead problems with the Monte Carlo code, looking at the amplifier concurrency test results I found that most of the overhead occur when some functions of the MC code calls this function : __kmp_get_global_thread_id_reg

Inside this functions the only OMP statements are of the THREADPRIVATE type, for example for AUSGAB:

      subroutine ausgab(iarg)
      implicit none
      integer*4 iarg,irl,jp
      real*8 aux
      common/score/ sc_array(102),sc_array2(102), sc_tmp(102),sc_last(10
     *2), sc_pulse_height(200,102),  icase,ipulse,de_pulse
      real*8 sc_array,  sc_array2,  sc_pulse_height,  sc_tmp,de_pulse
      integer*4 sc_last,icase,ipulse
C$OMP0THREADPRIVATE(/score/)
      COMMON/EPCONT/EDEP,TSTEP,TUSTEP,USTEP,TVSTEP,VSTEP, RHOF,EOLD,ENEW
     *,EKE,ELKE,GLE,E_RANGE, x_final,y_final,z_final, u_final,v_final,w_
     *final, IDISC,IROLD,IRNEW,IAUSFL(31)
C$OMP0THREADPRIVATE(/EPCONT/)
      DOUBLE PRECISION EDEP
      real*8 TSTEP,  TUSTEP,  USTEP,  VSTEP,  TVSTEP,  RHOF,  EOLD,  ENE
     *W,  EKE,  ELKE,  GLE,  E_RANGE, x_final,y_final,z_final,  u_final,
     *v_final,w_final
      integer*4 IDISC,  IROLD,  IRNEW,  IAUSFL
      COMMON/STACK/ E(50),X(50),Y(50),Z(50),U(50),V(50),W(50),DNEAR(50),
     *WT(50),IQ(50),IR(50),LATCH(50), LATCHI,NP,NPold
C$OMP0THREADPRIVATE(/STACK/)
      DOUBLE PRECISION E
      real*8 X,Y,Z,  U,V,W,  DNEAR,  WT
      integer*4 IQ,  IR,  LATCH,  LATCHI, NP,  NPold

can be the threadprivate clause so costly for the program, the strange thing is that now I do not see this behavior in the example codes that I gave above, that also have threadprivate clauses. Well, I suppose that porting legacy codes is not so simply... and finally, where could I find information about the __kmp_get_global_thread_id_reg (or other omp library) functions? Thanks for your help!.

jimdempseyatthecove · ‎10-06-2014

The good news about your chart is the work is now evenly distributed across time.

RE: __kmp_get_global_thread_id_reg

The __kmp_get_global_thread_id_reg may be used to dereference thread private data. The Thread Local Storage should be using FS or GS selector as a context reference and not use a function call. But things may have changed. What does the disassembly show of accessing a TLS variable?

Edgardo_Doerner · ‎10-06-2014

Well, now I am a little bit lost (I am far for being an expert programmer... I should have taken that "advanced programming" course when I was an undergraduate student!)...

Using the disassembly window of VS2013 (http://msdn.microsoft.com/en-us/library/a3cwf295.aspx) for the following piece of code I obtain the following (the COMMON variables labeled with THREADPRIVATE are sc_tmp, irl and edep

Fortran code:

      IF (( iarg .LT. 5 )) THEN
        irl = ir(np)
        IF (( icase .EQ. sc_last(irl) )) THEN
          sc_tmp(irl) = sc_tmp(irl) + edep  -> I stopped here
        ELSE
          aux = sc_tmp(irl)
          sc_array(irl) = sc_array(irl) + aux
          sc_array2(irl) = sc_array2(irl) + aux*aux
          sc_tmp(irl) = edep
          sc_last(irl) = icase

Disassembly Window:

      IF (( iarg .LT. 5 )) THEN
00131D2F  mov         eax,dword ptr [IARG]  
00131D32  mov         eax,dword ptr [eax]  
00131D34  cmp         eax,5  
00131D37  jge         AUSGAB+0D18h (01328A4h)  
        irl = ir(np)
00131D3D  mov         eax,106Ch  
00131D42  add         eax,dword ptr [NPOLD]  
00131D45  mov         eax,dword ptr [eax]  
00131D47  test        eax,eax  
00131D49  jg          AUSGAB+20Fh (0131D9Bh)  
00131D4B  add         esp,0FFFFFFE0h  
00131D4E  mov         dword ptr [esp],10100003h  
00131D55  mov         dword ptr [esp+4],2D75A0h  
        irl = ir(np)
00131D5D  mov         dword ptr [esp+8],5  
00131D65  mov         dword ptr [esp+0Ch],3  
00131D6D  mov         dword ptr [esp+10h],1  
00131D75  mov         dword ptr [esp+14h],2D7688h  
00131D7D  mov         eax,106Ch  
00131D82  add         eax,dword ptr [NPOLD]  
00131D85  mov         eax,dword ptr [eax]  
00131D87  mov         dword ptr [esp+18h],eax  
00131D8B  mov         dword ptr [esp+1Ch],1  
00131D93  call        _for_emit_diagnostic (02D3680h)  
00131D98  add         esp,20h  
00131D9B  mov         eax,106Ch  
00131DA0  add         eax,dword ptr [NPOLD]  
00131DA3  mov         eax,dword ptr [eax]  
00131DA5  cmp         eax,32h  
00131DA8  jle         AUSGAB+26Eh (0131DFAh)  
00131DAA  add         esp,0FFFFFFE0h  
00131DAD  mov         dword ptr [esp],10100002h  
00131DB4  mov         dword ptr [esp+4],2D7620h  
00131DBC  mov         dword ptr [esp+8],5  
00131DC4  mov         dword ptr [esp+0Ch],2  
00131DCC  mov         dword ptr [esp+10h],1  
00131DD4  mov         dword ptr [esp+14h],2D768Ch  
00131DDC  mov         eax,106Ch  
00131DE1  add         eax,dword ptr [NPOLD]  
00131DE4  mov         eax,dword ptr [eax]  
00131DE6  mov         dword ptr [esp+18h],eax  
00131DEA  mov         dword ptr [esp+1Ch],32h  
00131DF2  call        _for_emit_diagnostic (02D3680h)  
00131DF7  add         esp,20h  
00131DFA  mov         eax,0ED8h  
00131DFF  add         eax,dword ptr [NPOLD]  
00131E02  mov         edx,106Ch  
00131E07  add         edx,dword ptr [NPOLD]  
00131E0A  mov         edx,dword ptr [edx]  
00131E0C  imul        edx,edx,4  
00131E0F  add         eax,edx  
00131E11  add         eax,0FFFFFFFCh  
00131E14  mov         eax,dword ptr [eax]  
00131E16  mov         dword ptr [IRL],eax  
        IF (( icase .EQ. sc_last(irl) )) THEN
00131E19  mov         eax,dword ptr [IRL]  
00131E1C  test        eax,eax  
00131E1E  jg          AUSGAB+2DDh (0131E69h)  
00131E20  add         esp,0FFFFFFE0h  
00131E23  mov         dword ptr [esp],10100003h  
00131E2A  mov         dword ptr [esp+4],2D75A0h  
00131E32  mov         dword ptr [esp+8],5  
00131E3A  mov         dword ptr [esp+0Ch],3  
00131E42  mov         dword ptr [esp+10h],1  
00131E4A  mov         dword ptr [esp+14h],2D7690h  
00131E52  mov         eax,dword ptr [IRL]  
00131E55  mov         dword ptr [esp+18h],eax  
00131E59  mov         dword ptr [esp+1Ch],1  
00131E61  call        _for_emit_diagnostic (02D3680h)  
00131E66  add         esp,20h  
00131E69  mov         eax,dword ptr [IRL]  
00131E6C  cmp         eax,66h  
00131E6F  jle         AUSGAB+32Eh (0131EBAh)  
00131E71  add         esp,0FFFFFFE0h  
00131E74  mov         dword ptr [esp],10100002h  
00131E7B  mov         dword ptr [esp+4],2D7620h  
00131E83  mov         dword ptr [esp+8],5  
00131E8B  mov         dword ptr [esp+0Ch],2  
00131E93  mov         dword ptr [esp+10h],1  
00131E9B  mov         dword ptr [esp+14h],2D769Ch  
00131EA3  mov         eax,dword ptr [IRL]  
00131EA6  mov         dword ptr [esp+18h],eax  
00131EAA  mov         dword ptr [esp+1Ch],66h  
00131EB2  call        _for_emit_diagnostic (02D3680h)  
00131EB7  add         esp,20h  
00131EBA  mov         eax,288A8h  
00131EBF  add         eax,dword ptr [DE_PULSE]  
00131EC2  mov         eax,dword ptr [eax]  
00131EC4  mov         edx,990h  
00131EC9  add         edx,dword ptr [DE_PULSE]  
00131ECC  mov         ecx,dword ptr [IRL]  
00131ECF  imul        ecx,ecx,4  
00131ED2  add         edx,ecx  
00131ED4  add         edx,0FFFFFFFCh  
00131ED7  mov         edx,dword ptr [edx]  
00131ED9  cmp         eax,edx  
00131EDB  jne         AUSGAB+4D5h (0132061h)  
          sc_tmp(irl) = sc_tmp(irl) + edep
00131EE1  mov         eax,dword ptr [IRL]  
00131EE4  test        eax,eax  
00131EE6  jg          AUSGAB+3A5h (0131F31h)  
00131EE8  add         esp,0FFFFFFE0h  
00131EEB  mov         dword ptr [esp],10100003h  
00131EF2  mov         dword ptr [esp+4],2D75A0h  
00131EFA  mov         dword ptr [esp+8],5  
00131F02  mov         dword ptr [esp+0Ch],3  
00131F0A  mov         dword ptr [esp+10h],1  
00131F12  mov         dword ptr [esp+14h],2D76A8h  
00131F1A  mov         eax,dword ptr [IRL]  
00131F1D  mov         dword ptr [esp+18h],eax  
00131F21  mov         dword ptr [esp+1Ch],1  
00131F29  call        _for_emit_diagnostic (02D3680h)  
00131F2E  add         esp,20h  
00131F31  mov         eax,dword ptr [IRL]  
00131F34  cmp         eax,66h  
00131F37  jle         AUSGAB+3F6h (0131F82h)  
00131F39  add         esp,0FFFFFFE0h  
00131F3C  mov         dword ptr [esp],10100002h  
00131F43  mov         dword ptr [esp+4],2D7620h  
00131F4B  mov         dword ptr [esp+8],5  
00131F53  mov         dword ptr [esp+0Ch],2  
00131F5B  mov         dword ptr [esp+10h],1  
00131F63  mov         dword ptr [esp+14h],2D76B0h  
00131F6B  mov         eax,dword ptr [IRL]  
00131F6E  mov         dword ptr [esp+18h],eax  
00131F72  mov         dword ptr [esp+1Ch],66h  
00131F7A  call        _for_emit_diagnostic (02D3680h)  
00131F7F  add         esp,20h  
00131F82  mov         eax,dword ptr [IRL]  
00131F85  test        eax,eax  
00131F87  jg          AUSGAB+446h (0131FD2h)  
00131F89  add         esp,0FFFFFFE0h  
00131F8C  mov         dword ptr [esp],10100003h  
00131F93  mov         dword ptr [esp+4],2D75A0h  
00131F9B  mov         dword ptr [esp+8],5  
00131FA3  mov         dword ptr [esp+0Ch],3  
00131FAB  mov         dword ptr [esp+10h],1  
00131FB3  mov         dword ptr [esp+14h],2D76B8h  
00131FBB  mov         eax,dword ptr [IRL]  
00131FBE  mov         dword ptr [esp+18h],eax  
00131FC2  mov         dword ptr [esp+1Ch],1  
00131FCA  call        _for_emit_diagnostic (02D3680h)  
00131FCF  add         esp,20h  
00131FD2  mov         eax,dword ptr [IRL]  
00131FD5  cmp         eax,66h  
00131FD8  jle         AUSGAB+497h (0132023h)  
00131FDA  add         esp,0FFFFFFE0h  
00131FDD  mov         dword ptr [esp],10100002h  
00131FE4  mov         dword ptr [esp+4],2D7620h  
00131FEC  mov         dword ptr [esp+8],5  
00131FF4  mov         dword ptr [esp+0Ch],2  
00131FFC  mov         dword ptr [esp+10h],1  
00132004  mov         dword ptr [esp+14h],2D76C0h  
0013200C  mov         eax,dword ptr [IRL]  
0013200F  mov         dword ptr [esp+18h],eax  
00132013  mov         dword ptr [esp+1Ch],66h  
0013201B  call        _for_emit_diagnostic (02D3680h)  
00132020  add         esp,20h  
00132023  mov         eax,660h  
00132028  add         eax,dword ptr [DE_PULSE]  
0013202B  mov         edx,dword ptr [IRL]  
0013202E  imul        edx,edx,8  
00132031  add         eax,edx  
00132033  add         eax,0FFFFFFF8h  
      end
00132036  mov         edx,dword ptr [IAUSFL]  
          sc_tmp(irl) = sc_tmp(irl) + edep -> here the debugger stopped
00132039  movsd       xmm0,mmword ptr [eax]  
0013203D  movsd       xmm1,mmword ptr [edx]  
00132041  addsd       xmm0,xmm1  
00132045  mov         eax,660h  
0013204A  add         eax,dword ptr [DE_PULSE]  
0013204D  mov         edx,dword ptr [IRL]  
00132050  imul        edx,edx,8  
00132053  add         eax,edx  
00132055  add         eax,0FFFFFFF8h  
00132058  movsd       mmword ptr [eax],xmm0  
0013205C  jmp         AUSGAB+0D18h (01328A4h)

[...]

Thanks for your help!

IanH · ‎10-06-2014

Are you profiling a build with runtime debugging checks enabled?

jimdempseyatthecove · ‎10-07-2014

Notice the call _for_emit_diagnostic These are coming in at request of an option (IanH suggestion).

When tuning for Release build you should remove the profiling and any other runtime checks (e.g. index out of range, uninitialized variable, etc...). The emit diagnostic somewhat sounds as if the diagnostic data would be buffered in a thread safe manner, IOW using a critical section. Then eventually written, also in critical section. Therefore this will have not only the overhead of whatever the diagnostic is doing, but also a serializing effect.

VTune may be adding options to insert the call _for_emit_diagnostic statements.

Jim Dempsey

Edgardo_Doerner · ‎10-07-2014

Thanks to both for your comments, well, to obtain the assembly code I had to use the Debug mode (as suggested by the link).

using just the Release mode I obtain a similar pattern as the pictures shown above, looking at the compiling options of the Release mode I did not find any runtime check or diagnostics enabled, in the Command Line section the following appears:

/nologo /O2 /fpp /Qopenmp /module:"Release\\" /object:"Release\\" /Fd"Release\vc120.pdb" /libs:dll /threads /c

Well, just to be sure I decided to compile the program outside VS2013 (i.e. using the command prompt) with the following flags:

/fpp /Qopenmp /O3 /fast

and the I ran a concurrency analysis using the command prompt. I obtained the following results:

so now it seems that the overhead have decreased a lot comparing with previous results. There are still some overhead, now I feel more satisfied with the results. Anyway, also this program have many small subroutines where I need to use THREADPRIVATE clauses (for example, subroutine SHOWER calls PHOTON that calls BREMS that calls UPHI and then AUSGAB etc etc etc, each of them having their own THREADPRIVATE calls), so I think that this structure is also a major contribution to the total overhead cost.

Thanks for your help!

IanH · ‎10-07-2014

For profiling you want debug information (so the profiler knows how each machine instruction executed corresponds to the source code - the command line option you want for the compiler is /debug:all, if you are compiling and linking separately you also need to tell the linker to emit debug information with the linker /debug command line option), but you definitely don't want runtime debugging checks (the command line option you want for the compiler is /check:none) because as Jim suggests - they introduce considerable overhead and perhaps otherwise unnecessary synchronization between threads.

jimdempseyatthecove · ‎10-08-2014

The threadprivate is not introducing appreciable overhead. In your disassembly code, most of the code was for bounds checking of arrays.

At line 173, as an example, you had

add eax,dword ptr [DE_PULSE]

What this really was, was the +edep, however the disassembly process in VTune did not take into consideration the segment override prefix. The offset of edep from FS or GS selector as the case may be, happens to be origined at the same offset of DE_PULSE from DS/ES/CS.

If is an option to show the instruction bytes, you would have seen the segment override prefix. The above add defaults to the DS selector, but the instruction sequence appears to contain a segment override prefix (as evident by the wrong symbolic address in the disassembly).

Look for other factors than for threadprivate.

Jim Dempsey

Edgardo_Doerner · ‎10-08-2014

thanks for the comments,

I compiled with /fpp /Qopenmp /O3 /fast /debug:all /check:none flags obtaining the following.

If I double click on any subroutine like UPHI, MSCAT, etc I jump to the declaration of each subroutine (subroutine UPHI ....). If I doule clock on __kmp_get_global_thread_id_reg its showed what it seems to be assembly code of the libomp5md.dll

well, I will study more the code, but I have had a great advance thanks to you!, I really appreciate your help ;)

jimdempseyatthecove · ‎10-09-2014

In looking at your first image, and without seeing your code, I will make the following assumption:

It looks as if you are using too fine of grain of parallelization. Meaning parallel regions are entered very frequently, with each parallel region doing little work. If this is the case, then see if you can reorganize your code design, such that you raise the nest level at which you enter your parallel region.

As a side thought, are you by chance using nested parallel regions? (possibly unintentionally)

Jim Dempsey

Edgardo_Doerner · ‎10-10-2014

Well, Jim, as far as I know I do not have nested parallel regions, is there a way to have (unintentionally) nested regions without explicitly using the PARALLEL directive?. I only have one parallel region:

C$OMP0PARALLEL PRIVATE(OMP_IAM) COPYIN(/egs_vr/,/EPCONT/)
C$OMP*COPYIN(/ET_control/,/PHOTIN/,/randomm/,/UPHIOT/,/USEFUL/,/score/)
C$OMP0SINGLE
#ifdef _OPENMP
      OMP_TOT = OMP_GET_NUM_THREADS()
      WRITE(6,1060)OMP_TOT
1060  FORMAT(//'Number of OpenMP threads = ',I2//)
C$OMP0END SINGLE NOWAIT
#endif
#ifdef _OPENMP
      OMP_IAM = OMP_GET_THREAD_NUM()
#endif
      ixx = 1802
      jxx = 9373 + OMP_IAM
      call init_ranmar
C$OMP0DO SCHEDULE(static)
      DO 1071 icase_aux=1,ncase
        icase = icase_aux
        call shower(iqin,ein,xin,yin,zin,uin,vin,win,irin,wtin)
1071  CONTINUE
1072  CONTINUE
C$OMP0CRITICAL
      DO 1081 I=1,102
        r_sc_array(I) = r_sc_array(I) + sc_array(I)
        r_sc_array2(I) = r_sc_array2(I) + sc_array2(I)
        r_sc_tmp(I) = r_sc_tmp(I) + sc_tmp(I)
        DO 1091 j=1,200
          r_sc_pulse_height(j,i) = r_sc_pulse_height(j,i) + sc_pulse_hei
     *    ght(j,i)
1091    CONTINUE
1092    CONTINUE
1081  CONTINUE
1082  CONTINUE
C$OMP0END CRITICAL
C$OMP0END PARALLEL

The shower subroutine takes almost all the time length of the program. I thought about the THREADPRIVATE (or other clauses) because this program is composed of many small subroutines, and for each loop the program enters and exists these subroutines many many times. The other strange thing is that the overhead is much less when using gfortran, for example running this code for an AMD FX8350 (8 cores) CPU under Debian 7.5 I obtained the following graph when increasing the number of OMP threads (the ammount of particles is fixed):

(The BQS stands for batch queuing system, is the original parallelizing method) So one can see that at the beginning the ifort performance is much better than gfortran, but as the number of OMP Threads increases the runtimes start to get closer... the compilation flags were:

gfortran : -cpp -fopenmp -O3 -ffast-math
ifort: -fpp -qopenmp -O3 -fast -check none (the last just in case)

at the beginning of this thread the problems were much higher, but I still have this ifort v/s gfortran performance issue... thanks for your help!

jimdempseyatthecove · ‎10-11-2014

Scaling is getting near flat after 4. You wouldn't by chance be calling functions with critical sections. non-multi-thread random number generator, ALLOCATE/DEALLOCATE, file I/O, etc...?

A second cause for flattening is poor cache localization.

A third cause is you reach a memory bandwidth limitation.

All of these causes can usually be corrected with better programming techniques.

Jim Dempsey