- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi to everyone,
I am working on a Monte Carlo code, written in Fortran77 in order to make it parallel using OpenMP. Now I am in the testing phase of the development process, but I am facing problems with the overhead costs of the code. For example, when I analice it using Vtune Amplifier XE I obtain the following summary:
Elapsed Time: 43.352s
Total Thread Count: 5
Overhead Time: 16.560s
Spin Time: 0.847s
CPU Time: 157.369s
Paused Time: 0s
Well, the systems complains that the overhead time is too much. What is worst is that I have tested this code against gfortran and this effects are less pronunciated using the later. This is sad, because without parallelization the code compiled with ifort is much faster than gfortran, but as I increase the number of OMP threads (maintaining the load per thread constant) the overheads costs render the ifort version slower than the gfortran one.
What I have found is that the threads get "stalled" in a very disorder fashion, for example, you can see this in the image bellow
The code has several subroutines that controls all the Monte Carlo simulation process (for example, random number generation, electron and photon transport, geometry description, etc). This subroutines communicate each other using COMMON blocks, therefore I have had to flag some of them as private using the THREADPRIVATE statement when needed. The idea is to maintain the original structure of the code as much as possible, considering that this is a wide used code and the idea is to offer an easy transition to parallelization with OpenMP without changing the core of the program.
I have created a small code that runs only the random number generator and use them to estimate the value of PI. This code has also the same problem as the original code. In this last one what have I found is that the function _kmp_get_global_thread_id_reg has a great part of the overhead time:
Well, I would really appreciate if someone has a tip to face this problem. I have tried to search info about this problem without success. Thanks for your help!!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did a mistake putting the images, here are the correct
and the second
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A few things:
1) This may be immaterial: Why to you have "0" inserted in your C$OMP0... directives?
2) C$OMP0DO SCHEDULE(dynamic,25000000)
DYNAMIC
Can be used to get a set of iterations dynamically. It defaults to 1 unless chunk is specified.
If chunk is specified, the iterations are broken into pieces of a size specified by chunk. As each thread finishes a piece of the iteration space, it dynamically gets the next set of iterations.
So unless your ncase approximates 25000000 * number of threads or more, work will be imballanced
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your answer, jim, now the answers:
1) the reason is a little bit obscure. The original MC code is really written in MORTRAN and then is pre-compiled to Fortran. I managed to create different MACROS to handle the OpenMP pragmas, but for some reason my version of MORTRAN cannot accept an empty space as argument for a MACRO call (the idea was to be able to add a * in case that I need to break a line containing a OpenMP pragma). Therefore, the "best" solution at that time was to use a "0" instead of an empty space.
2) The chunk size is calculated as NCASE / (# OMP THREADS * 10). The original program uses that expression to divide the amount of work between threads. Maybe I could choose a smaller chunk, but at least what I see from the timeline is that the main source of overhead/spin time is these strange peak pattern...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In looking at your VTune charts it appears that only 2 threads out of 4 are doing any work, and that there work is not occurring at the same time (most of the time). This indicates than the amount of work per icase varies drastically causing some threads to wait long at the END DOs. It looks like you are using your own random number generator so I do not think you have any critical sections except for possibly the REDUCE which should not (with 4 threads) exhibit that much of overhead.
BTW, why not simply use C$OMP0CRITICAL or C$OMP0ATOMIC to perform hits = hits + count?
Try removing the SCHEDULE as a test to see what happens. IOW see if you can get a base configuration that works, then optimize (as opposed to starting with something that apparently is not working and then backing into something that works).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another thing to shed light is just inside C$OMP0PARALLEL insert a
t0 = omp_get_wtime() ! t0 and t1 inside thread private common /score/
and just prior to exint of C$OMP0PARALLEL
t1 = omp_get_wtime()
C$OMP0CRITICAL
print *,"runtime ",omp_get_thread_num(), t1-t0
This may show you a discrepancy of work between threads. Which if there is, you can then find out why.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
I measured the work time of each thread as you said but I found a minimal difference between them. I followed your advice and used atomic operation to record the total number of hits, thanks for the tip!.
Well, experimenting with a different RNG (Ziggurat - http://people.sc.fsu.edu/~jburkardt/f77_src/ziggurat_openmp/ziggurat_openmp.html) I implemented it with the example code that I gave above and it does not showed this problem, so at that moment it seemed that the problem was the RANMAR RNG.
However, by mistake I ran an Amplifier Concurrency Test without the recommended compilations flags as stated in the user manual (just Release mode) and the problem disappeared with the original generator. It is weird, because first the Ziggurat OMP does not suffer this problem but only the RANMAR RNG... but well, running the test with Release compilation options removed the problem... strange, not?
Anyway, at least I found another RNG to play with... I leave the code if you want to test it... thanks for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In taking a quick look at your attached .f file it appears that all threads will be using the same sequence of random numbers.
jsr = 123456789
...
seed = shr3( jsr )
Consider using
seed = shr3( ixor(jsr, (omp_iam +1) * 101010101))
Or something like that.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ups, I did not realized that, thanks for finding that error ;)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, I still have overhead problems with the Monte Carlo code, looking at the amplifier concurrency test results I found that most of the overhead occur when some functions of the MC code calls this function : __kmp_get_global_thread_id_reg
Inside this functions the only OMP statements are of the THREADPRIVATE type, for example for AUSGAB:
subroutine ausgab(iarg) implicit none integer*4 iarg,irl,jp real*8 aux common/score/ sc_array(102),sc_array2(102), sc_tmp(102),sc_last(10 *2), sc_pulse_height(200,102), icase,ipulse,de_pulse real*8 sc_array, sc_array2, sc_pulse_height, sc_tmp,de_pulse integer*4 sc_last,icase,ipulse C$OMP0THREADPRIVATE(/score/) COMMON/EPCONT/EDEP,TSTEP,TUSTEP,USTEP,TVSTEP,VSTEP, RHOF,EOLD,ENEW *,EKE,ELKE,GLE,E_RANGE, x_final,y_final,z_final, u_final,v_final,w_ *final, IDISC,IROLD,IRNEW,IAUSFL(31) C$OMP0THREADPRIVATE(/EPCONT/) DOUBLE PRECISION EDEP real*8 TSTEP, TUSTEP, USTEP, VSTEP, TVSTEP, RHOF, EOLD, ENE *W, EKE, ELKE, GLE, E_RANGE, x_final,y_final,z_final, u_final, *v_final,w_final integer*4 IDISC, IROLD, IRNEW, IAUSFL COMMON/STACK/ E(50),X(50),Y(50),Z(50),U(50),V(50),W(50),DNEAR(50), *WT(50),IQ(50),IR(50),LATCH(50), LATCHI,NP,NPold C$OMP0THREADPRIVATE(/STACK/) DOUBLE PRECISION E real*8 X,Y,Z, U,V,W, DNEAR, WT integer*4 IQ, IR, LATCH, LATCHI, NP, NPold
can be the threadprivate clause so costly for the program, the strange thing is that now I do not see this behavior in the example codes that I gave above, that also have threadprivate clauses. Well, I suppose that porting legacy codes is not so simply... and finally, where could I find information about the __kmp_get_global_thread_id_reg (or other omp library) functions? Thanks for your help!.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The good news about your chart is the work is now evenly distributed across time.
RE: __kmp_get_global_thread_id_reg
The __kmp_get_global_thread_id_reg may be used to dereference thread private data. The Thread Local Storage should be using FS or GS selector as a context reference and not use a function call. But things may have changed. What does the disassembly show of accessing a TLS variable?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, now I am a little bit lost (I am far for being an expert programmer... I should have taken that "advanced programming" course when I was an undergraduate student!)...
Using the disassembly window of VS2013 (http://msdn.microsoft.com/en-us/library/a3cwf295.aspx) for the following piece of code I obtain the following (the COMMON variables labeled with THREADPRIVATE are sc_tmp, irl and edep
Fortran code:
IF (( iarg .LT. 5 )) THEN irl = ir(np) IF (( icase .EQ. sc_last(irl) )) THEN sc_tmp(irl) = sc_tmp(irl) + edep -> I stopped here ELSE aux = sc_tmp(irl) sc_array(irl) = sc_array(irl) + aux sc_array2(irl) = sc_array2(irl) + aux*aux sc_tmp(irl) = edep sc_last(irl) = icase
Disassembly Window:
IF (( iarg .LT. 5 )) THEN 00131D2F mov eax,dword ptr [IARG] 00131D32 mov eax,dword ptr [eax] 00131D34 cmp eax,5 00131D37 jge AUSGAB+0D18h (01328A4h) irl = ir(np) 00131D3D mov eax,106Ch 00131D42 add eax,dword ptr [NPOLD] 00131D45 mov eax,dword ptr [eax] 00131D47 test eax,eax 00131D49 jg AUSGAB+20Fh (0131D9Bh) 00131D4B add esp,0FFFFFFE0h 00131D4E mov dword ptr [esp],10100003h 00131D55 mov dword ptr [esp+4],2D75A0h irl = ir(np) 00131D5D mov dword ptr [esp+8],5 00131D65 mov dword ptr [esp+0Ch],3 00131D6D mov dword ptr [esp+10h],1 00131D75 mov dword ptr [esp+14h],2D7688h 00131D7D mov eax,106Ch 00131D82 add eax,dword ptr [NPOLD] 00131D85 mov eax,dword ptr [eax] 00131D87 mov dword ptr [esp+18h],eax 00131D8B mov dword ptr [esp+1Ch],1 00131D93 call _for_emit_diagnostic (02D3680h) 00131D98 add esp,20h 00131D9B mov eax,106Ch 00131DA0 add eax,dword ptr [NPOLD] 00131DA3 mov eax,dword ptr [eax] 00131DA5 cmp eax,32h 00131DA8 jle AUSGAB+26Eh (0131DFAh) 00131DAA add esp,0FFFFFFE0h 00131DAD mov dword ptr [esp],10100002h 00131DB4 mov dword ptr [esp+4],2D7620h 00131DBC mov dword ptr [esp+8],5 00131DC4 mov dword ptr [esp+0Ch],2 00131DCC mov dword ptr [esp+10h],1 00131DD4 mov dword ptr [esp+14h],2D768Ch 00131DDC mov eax,106Ch 00131DE1 add eax,dword ptr [NPOLD] 00131DE4 mov eax,dword ptr [eax] 00131DE6 mov dword ptr [esp+18h],eax 00131DEA mov dword ptr [esp+1Ch],32h 00131DF2 call _for_emit_diagnostic (02D3680h) 00131DF7 add esp,20h 00131DFA mov eax,0ED8h 00131DFF add eax,dword ptr [NPOLD] 00131E02 mov edx,106Ch 00131E07 add edx,dword ptr [NPOLD] 00131E0A mov edx,dword ptr [edx] 00131E0C imul edx,edx,4 00131E0F add eax,edx 00131E11 add eax,0FFFFFFFCh 00131E14 mov eax,dword ptr [eax] 00131E16 mov dword ptr [IRL],eax IF (( icase .EQ. sc_last(irl) )) THEN 00131E19 mov eax,dword ptr [IRL] 00131E1C test eax,eax 00131E1E jg AUSGAB+2DDh (0131E69h) 00131E20 add esp,0FFFFFFE0h 00131E23 mov dword ptr [esp],10100003h 00131E2A mov dword ptr [esp+4],2D75A0h 00131E32 mov dword ptr [esp+8],5 00131E3A mov dword ptr [esp+0Ch],3 00131E42 mov dword ptr [esp+10h],1 00131E4A mov dword ptr [esp+14h],2D7690h 00131E52 mov eax,dword ptr [IRL] 00131E55 mov dword ptr [esp+18h],eax 00131E59 mov dword ptr [esp+1Ch],1 00131E61 call _for_emit_diagnostic (02D3680h) 00131E66 add esp,20h 00131E69 mov eax,dword ptr [IRL] 00131E6C cmp eax,66h 00131E6F jle AUSGAB+32Eh (0131EBAh) 00131E71 add esp,0FFFFFFE0h 00131E74 mov dword ptr [esp],10100002h 00131E7B mov dword ptr [esp+4],2D7620h 00131E83 mov dword ptr [esp+8],5 00131E8B mov dword ptr [esp+0Ch],2 00131E93 mov dword ptr [esp+10h],1 00131E9B mov dword ptr [esp+14h],2D769Ch 00131EA3 mov eax,dword ptr [IRL] 00131EA6 mov dword ptr [esp+18h],eax 00131EAA mov dword ptr [esp+1Ch],66h 00131EB2 call _for_emit_diagnostic (02D3680h) 00131EB7 add esp,20h 00131EBA mov eax,288A8h 00131EBF add eax,dword ptr [DE_PULSE] 00131EC2 mov eax,dword ptr [eax] 00131EC4 mov edx,990h 00131EC9 add edx,dword ptr [DE_PULSE] 00131ECC mov ecx,dword ptr [IRL] 00131ECF imul ecx,ecx,4 00131ED2 add edx,ecx 00131ED4 add edx,0FFFFFFFCh 00131ED7 mov edx,dword ptr [edx] 00131ED9 cmp eax,edx 00131EDB jne AUSGAB+4D5h (0132061h) sc_tmp(irl) = sc_tmp(irl) + edep 00131EE1 mov eax,dword ptr [IRL] 00131EE4 test eax,eax 00131EE6 jg AUSGAB+3A5h (0131F31h) 00131EE8 add esp,0FFFFFFE0h 00131EEB mov dword ptr [esp],10100003h 00131EF2 mov dword ptr [esp+4],2D75A0h 00131EFA mov dword ptr [esp+8],5 00131F02 mov dword ptr [esp+0Ch],3 00131F0A mov dword ptr [esp+10h],1 00131F12 mov dword ptr [esp+14h],2D76A8h 00131F1A mov eax,dword ptr [IRL] 00131F1D mov dword ptr [esp+18h],eax 00131F21 mov dword ptr [esp+1Ch],1 00131F29 call _for_emit_diagnostic (02D3680h) 00131F2E add esp,20h 00131F31 mov eax,dword ptr [IRL] 00131F34 cmp eax,66h 00131F37 jle AUSGAB+3F6h (0131F82h) 00131F39 add esp,0FFFFFFE0h 00131F3C mov dword ptr [esp],10100002h 00131F43 mov dword ptr [esp+4],2D7620h 00131F4B mov dword ptr [esp+8],5 00131F53 mov dword ptr [esp+0Ch],2 00131F5B mov dword ptr [esp+10h],1 00131F63 mov dword ptr [esp+14h],2D76B0h 00131F6B mov eax,dword ptr [IRL] 00131F6E mov dword ptr [esp+18h],eax 00131F72 mov dword ptr [esp+1Ch],66h 00131F7A call _for_emit_diagnostic (02D3680h) 00131F7F add esp,20h 00131F82 mov eax,dword ptr [IRL] 00131F85 test eax,eax 00131F87 jg AUSGAB+446h (0131FD2h) 00131F89 add esp,0FFFFFFE0h 00131F8C mov dword ptr [esp],10100003h 00131F93 mov dword ptr [esp+4],2D75A0h 00131F9B mov dword ptr [esp+8],5 00131FA3 mov dword ptr [esp+0Ch],3 00131FAB mov dword ptr [esp+10h],1 00131FB3 mov dword ptr [esp+14h],2D76B8h 00131FBB mov eax,dword ptr [IRL] 00131FBE mov dword ptr [esp+18h],eax 00131FC2 mov dword ptr [esp+1Ch],1 00131FCA call _for_emit_diagnostic (02D3680h) 00131FCF add esp,20h 00131FD2 mov eax,dword ptr [IRL] 00131FD5 cmp eax,66h 00131FD8 jle AUSGAB+497h (0132023h) 00131FDA add esp,0FFFFFFE0h 00131FDD mov dword ptr [esp],10100002h 00131FE4 mov dword ptr [esp+4],2D7620h 00131FEC mov dword ptr [esp+8],5 00131FF4 mov dword ptr [esp+0Ch],2 00131FFC mov dword ptr [esp+10h],1 00132004 mov dword ptr [esp+14h],2D76C0h 0013200C mov eax,dword ptr [IRL] 0013200F mov dword ptr [esp+18h],eax 00132013 mov dword ptr [esp+1Ch],66h 0013201B call _for_emit_diagnostic (02D3680h) 00132020 add esp,20h 00132023 mov eax,660h 00132028 add eax,dword ptr [DE_PULSE] 0013202B mov edx,dword ptr [IRL] 0013202E imul edx,edx,8 00132031 add eax,edx 00132033 add eax,0FFFFFFF8h end 00132036 mov edx,dword ptr [IAUSFL] sc_tmp(irl) = sc_tmp(irl) + edep -> here the debugger stopped 00132039 movsd xmm0,mmword ptr [eax] 0013203D movsd xmm1,mmword ptr [edx] 00132041 addsd xmm0,xmm1 00132045 mov eax,660h 0013204A add eax,dword ptr [DE_PULSE] 0013204D mov edx,dword ptr [IRL] 00132050 imul edx,edx,8 00132053 add eax,edx 00132055 add eax,0FFFFFFF8h 00132058 movsd mmword ptr [eax],xmm0 0013205C jmp AUSGAB+0D18h (01328A4h)
[...]
Thanks for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you profiling a build with runtime debugging checks enabled?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Notice the call _for_emit_diagnostic These are coming in at request of an option (IanH suggestion).
When tuning for Release build you should remove the profiling and any other runtime checks (e.g. index out of range, uninitialized variable, etc...). The emit diagnostic somewhat sounds as if the diagnostic data would be buffered in a thread safe manner, IOW using a critical section. Then eventually written, also in critical section. Therefore this will have not only the overhead of whatever the diagnostic is doing, but also a serializing effect.
VTune may be adding options to insert the call _for_emit_diagnostic statements.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks to both for your comments, well, to obtain the assembly code I had to use the Debug mode (as suggested by the link).
using just the Release mode I obtain a similar pattern as the pictures shown above, looking at the compiling options of the Release mode I did not find any runtime check or diagnostics enabled, in the Command Line section the following appears:
/nologo /O2 /fpp /Qopenmp /module:"Release\\" /object:"Release\\" /Fd"Release\vc120.pdb" /libs:dll /threads /c
Well, just to be sure I decided to compile the program outside VS2013 (i.e. using the command prompt) with the following flags:
/fpp /Qopenmp /O3 /fast
and the I ran a concurrency analysis using the command prompt. I obtained the following results:
so now it seems that the overhead have decreased a lot comparing with previous results. There are still some overhead, now I feel more satisfied with the results. Anyway, also this program have many small subroutines where I need to use THREADPRIVATE clauses (for example, subroutine SHOWER calls PHOTON that calls BREMS that calls UPHI and then AUSGAB etc etc etc, each of them having their own THREADPRIVATE calls), so I think that this structure is also a major contribution to the total overhead cost.
Thanks for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For profiling you want debug information (so the profiler knows how each machine instruction executed corresponds to the source code - the command line option you want for the compiler is /debug:all, if you are compiling and linking separately you also need to tell the linker to emit debug information with the linker /debug command line option), but you definitely don't want runtime debugging checks (the command line option you want for the compiler is /check:none) because as Jim suggests - they introduce considerable overhead and perhaps otherwise unnecessary synchronization between threads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The threadprivate is not introducing appreciable overhead. In your disassembly code, most of the code was for bounds checking of arrays.
At line 173, as an example, you had
add eax,dword ptr [DE_PULSE]
What this really was, was the +edep, however the disassembly process in VTune did not take into consideration the segment override prefix. The offset of edep from FS or GS selector as the case may be, happens to be origined at the same offset of DE_PULSE from DS/ES/CS.
If is an option to show the instruction bytes, you would have seen the segment override prefix. The above add defaults to the DS selector, but the instruction sequence appears to contain a segment override prefix (as evident by the wrong symbolic address in the disassembly).
Look for other factors than for threadprivate.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks for the comments,
I compiled with /fpp /Qopenmp /O3 /fast /debug:all /check:none flags obtaining the following.
If I double click on any subroutine like UPHI, MSCAT, etc I jump to the declaration of each subroutine (subroutine UPHI ....). If I doule clock on __kmp_get_global_thread_id_reg its showed what it seems to be assembly code of the libomp5md.dll
well, I will study more the code, but I have had a great advance thanks to you!, I really appreciate your help ;)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In looking at your first image, and without seeing your code, I will make the following assumption:
It looks as if you are using too fine of grain of parallelization. Meaning parallel regions are entered very frequently, with each parallel region doing little work. If this is the case, then see if you can reorganize your code design, such that you raise the nest level at which you enter your parallel region.
As a side thought, are you by chance using nested parallel regions? (possibly unintentionally)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, Jim, as far as I know I do not have nested parallel regions, is there a way to have (unintentionally) nested regions without explicitly using the PARALLEL directive?. I only have one parallel region:
C$OMP0PARALLEL PRIVATE(OMP_IAM) COPYIN(/egs_vr/,/EPCONT/) C$OMP*COPYIN(/ET_control/,/PHOTIN/,/randomm/,/UPHIOT/,/USEFUL/,/score/) C$OMP0SINGLE #ifdef _OPENMP OMP_TOT = OMP_GET_NUM_THREADS() WRITE(6,1060)OMP_TOT 1060 FORMAT(//'Number of OpenMP threads = ',I2//) C$OMP0END SINGLE NOWAIT #endif #ifdef _OPENMP OMP_IAM = OMP_GET_THREAD_NUM() #endif ixx = 1802 jxx = 9373 + OMP_IAM call init_ranmar C$OMP0DO SCHEDULE(static) DO 1071 icase_aux=1,ncase icase = icase_aux call shower(iqin,ein,xin,yin,zin,uin,vin,win,irin,wtin) 1071 CONTINUE 1072 CONTINUE C$OMP0CRITICAL DO 1081 I=1,102 r_sc_array(I) = r_sc_array(I) + sc_array(I) r_sc_array2(I) = r_sc_array2(I) + sc_array2(I) r_sc_tmp(I) = r_sc_tmp(I) + sc_tmp(I) DO 1091 j=1,200 r_sc_pulse_height(j,i) = r_sc_pulse_height(j,i) + sc_pulse_hei * ght(j,i) 1091 CONTINUE 1092 CONTINUE 1081 CONTINUE 1082 CONTINUE C$OMP0END CRITICAL C$OMP0END PARALLEL
The shower subroutine takes almost all the time length of the program. I thought about the THREADPRIVATE (or other clauses) because this program is composed of many small subroutines, and for each loop the program enters and exists these subroutines many many times. The other strange thing is that the overhead is much less when using gfortran, for example running this code for an AMD FX8350 (8 cores) CPU under Debian 7.5 I obtained the following graph when increasing the number of OMP threads (the ammount of particles is fixed):
(The BQS stands for batch queuing system, is the original parallelizing method) So one can see that at the beginning the ifort performance is much better than gfortran, but as the number of OMP Threads increases the runtimes start to get closer... the compilation flags were:
gfortran : -cpp -fopenmp -O3 -ffast-math
ifort: -fpp -qopenmp -O3 -fast -check none (the last just in case)
at the beginning of this thread the problems were much higher, but I still have this ifort v/s gfortran performance issue... thanks for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Scaling is getting near flat after 4. You wouldn't by chance be calling functions with critical sections. non-multi-thread random number generator, ALLOCATE/DEALLOCATE, file I/O, etc...?
A second cause for flattening is poor cache localization.
A third cause is you reach a memory bandwidth limitation.
All of these causes can usually be corrected with better programming techniques.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page