Yes, Amplifier will show you - Page 2

Edgardo_Doerner · ‎10-01-2014

Hi to everyone,

I am working on a Monte Carlo code, written in Fortran77 in order to make it parallel using OpenMP. Now I am in the testing phase of the development process, but I am facing problems with the overhead costs of the code. For example, when I analice it using Vtune Amplifier XE I obtain the following summary:

Elapsed Time:   43.352s
Total Thread Count:   5
Overhead Time:   16.560s
Spin Time:   0.847s
CPU Time:   157.369s
Paused Time:   0s

Well, the systems complains that the overhead time is too much. What is worst is that I have tested this code against gfortran and this effects are less pronunciated using the later. This is sad, because without parallelization the code compiled with ifort is much faster than gfortran, but as I increase the number of OMP threads (maintaining the load per thread constant) the overheads costs render the ifort version slower than the gfortran one.

What I have found is that the threads get "stalled" in a very disorder fashion, for example, you can see this in the image bellow

The code has several subroutines that controls all the Monte Carlo simulation process (for example, random number generation, electron and photon transport, geometry description, etc). This subroutines communicate each other using COMMON blocks, therefore I have had to flag some of them as private using the THREADPRIVATE statement when needed. The idea is to maintain the original structure of the code as much as possible, considering that this is a wide used code and the idea is to offer an easy transition to parallelization with OpenMP without changing the core of the program.

I have created a small code that runs only the random number generator and use them to estimate the value of PI. This code has also the same problem as the original code. In this last one what have I found is that the function _kmp_get_global_thread_id_reg has a great part of the overhead time:

Well, I would really appreciate if someone has a tip to face this problem. I have tried to search info about this problem without success. Thanks for your help!!

Edgardo_Doerner · ‎11-03-2014

Well, thanks for the tips... I have been studying the code and I found the following situation:

Looking at the Amplifier analysis I found that most of the overhead costs come from two subroutines: UPHI and MSCAT, and (very) far away comes the random generator subroutine RANMAR_GET, the concurrency test is the following:

Inside UPHI there were some WRITE sentences to a log file, so I removed them. After that I obtained the following:

so it seems that in the case of UPHI that was the problem, unfortunately for MSCAT and RANMAR_GET I was not able to find a solution. However, using a MACRO one can disable the MSCAT subroutine (the body of the subroutine is replaced with a return, basically), so I did obtaining the following:

Well, I have not checked in detail the other subroutines, but all of them call RANMAR_GET. I really suspect that that subroutine is the source of the overhead of the program, but when I look at the code I am really not able to find an obvious reason for that. The RANMAR_GET code is:

      subroutine ranmar_get
      implicit none
      common/randomm/ rng_array(128), urndm(97), crndm, cdrndm, cmrndm,
     *i4opt, ixx, jxx, fool_optimizer, twom24, rng_seed
C$OMP0THREADPRIVATE(/randomm/)
      integer*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti
     *mizer,rng_seed,rng_array
      real*4 twom24
      integer*4 i,iopt
      DO 2591 i=1,128
        iopt = urndm(ixx) - urndm(jxx)
        IF((iopt .LT. 0))iopt = iopt + 16777216
        urndm(ixx) = iopt
        ixx = ixx - 1
        jxx = jxx - 1
        IF ((ixx .EQ. 0)) THEN
          ixx = 97
        ELSE IF(( jxx .EQ. 0 )) THEN
          jxx = 97
        END IF
        crndm = crndm - cdrndm
        IF((crndm .LT. 0))crndm = crndm + cmrndm
        iopt = iopt - crndm
        IF((iopt .LT. 0))iopt = iopt + 16777216
        rng_array(i) = iopt
2591  CONTINUE
2592  CONTINUE
      rng_seed = 1
      return
      end

what could cause the overhead?, the if sentences?, or it is something related with poor cache localization. Is there a way to detect that using the Intel dev software (like Amplifier)?. Unfortunately Amplifier does not point to a specific piece of code, just the call to RANMAR_GET or the subroutine declaration... thanks for your help!

Steven_L_Intel1 · ‎11-03-2014

Yes, Amplifier will show you the code - even down to the instruction. Double-click on the subroutine to drill down.

jimdempseyatthecove · ‎11-04-2014

Will ixx ever equal jxx? (the code above requires the answer to be no)

Does ixx and jxx always have the same circular reference offset?

Consider replacing (in both places)

        iopt = urndm(ixx) - urndm(jxx)
        IF((iopt .LT. 0))iopt = iopt + 16777216
with
       iopt = IAND(urndm(ixx) - urndm(jxx), 16777215) ! note end digit is 5, making mask
or
       iopt = IAND(urndm(ixx) - urndm(jxx), 'FFFFFF'Z)

Also, if cmrdm is also a power of 2, replace the if test and add with an IAND as above.

Depending on how smart the compiler optimization is, it may be more efficient to use

        ixx = ixx - 1
        jxx = jxx - 1
        IF(( ixx .EQ. 0 )) ixx = 97
        IF(( jxx .EQ. 0 )) jxx = 97

The reason being is the recent instruction sets have conditional move instructions (thus eliminating branch instructions).

If all of the above hints apply, then the DO 2591 loop will have no branches

John_Campbell · ‎11-05-2014

Could the performance problem be related to the use of C$OMP0THREADPRIVATE(/randomm/)

You could try something like below and manage the private variables in a different way, say as arrays in the call, based on the thread number:

      subroutine test_ranmar

      integer*4, parameter :: nt = 7 ! max thread number
      integer*4 rng_array(128,0:nt), urndm(97,0:nt), crndm(0:nt),
     *          cdrndm, cmrndm,  
     *          ixx(0:nt), jxx(0:nt), rng_seed(0:nt), it
c
      it = 0
C$    it = omp_get_thread_num ()
      call ranmar_get (rng_array(:,it), urndm(:,it), crndm(it), 
     *                 cdrndm, cmrndm, 
     *                 ixx(it), jxx(it), rng_seed(it) )

      end

      subroutine ranmar_get (rng_array, urndm, crndm, cdrndm, cmrndm, 
     *                       ixx, jxx, rng_seed)
      implicit none
      integer*4 rng_array(128), urndm(97), crndm, cdrndm, cmrndm,  
     *          ixx, jxx, rng_seed
      integer*4 i,iopt
C
      DO 2591 i=1,128
        iopt  = urndm(ixx) - urndm(jxx) 
            IF (iopt  < 0) iopt  = iopt  + 16777216
        urndm(ixx) = iopt
        ixx   = ixx   - 1
            IF (ixx  == 0) ixx   = 97
        jxx   = jxx   - 1
            IF (jxx  == 0) jxx   = 97
        crndm = crndm - cdrndm 
            IF (crndm < 0) crndm = crndm + cmrndm
        iopt  = iopt  - crndm
            IF (iopt  < 0) iopt  = iopt  + 16777216
        rng_array(i) = iopt
2591  CONTINUE
      rng_seed = 1
      return
      end

Edgardo_Doerner · ‎11-16-2014

@Steve,

My problem is when I double click on the subroutine (e.g. RANMAR_GET) Amplifier opens a tab with the source code, but it shows me the entry point of the subroutine (i.e. the subroutine RANMAR_GET statement). If I double click this line the source code is opened in VS, so I have no clue of which specific line (or lines) of code is causing problems.

@Jim

I modified my code accordingly to your suggestions and I obtain some improvement. The code looks as follows:

      subroutine ranmar_get
      implicit none
      common/randomm/ rng_array(128), urndm(97), crndm, cdrndm, cmrndm,
     *i4opt, ixx, jxx, fool_optimizer, twom24, rng_seed
C$OMP0THREADPRIVATE(/randomm/)
      integer*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti
     *mizer,rng_seed,rng_array
      real*4 twom24
      integer*4 i,iopt
C ED: RANMAR_GET modification following Jim's suggestions
      IF((rng_seed .EQ. 999999))call init_ranmar
      DO 2591 i=1,128
C        iopt = urndm(ixx) - urndm(jxx)
C        IF((iopt .LT. 0))iopt = iopt + 16777216
        iopt = IAND(urndm(ixx) - urndm(jxx), 'FFFFFF'Z)
CCC
        urndm(ixx) = iopt
        ixx = ixx - 1
        jxx = jxx - 1
        IF ((ixx .EQ. 0)) ixx = 97
        IF(( jxx .EQ. 0 )) jxx = 97
CCC 
C        crndm = crndm - cdrndm
C        IF((crndm .LT. 0))crndm = crndm + cmrndm
        crndm = IAND(crndm - cdrndm, cmrndm-1
CCC
C        iopt = iopt - crndm
C        IF((iopt .LT. 0))iopt = iopt + 16777216
         iopt = IAND(iopt - crndm, 'FFFFFF'Z)
CCC
        rng_array(i) = iopt
2591  CONTINUE
2592  CONTINUE
      rng_seed = 1
      return
      end

Looking at the code I realized that the first IF sentence never is true (rng_seed .EQ. 999999), therefore I decided to remove it. Unfortunately the result is disastrous:

so, what happened??? now UPHI again gives problem and appeared SSCAT (scattering related function).... but at least RANMAR_GET disappeared xD...

Well, I have some idea, How do affect GO TO statements to OpenMP?. For example, the UPHI subroutine is full of these, here the code:

      SUBROUTINE UPHI(IENTRY,LVL)
      ! Copyright National Research Council of Canada, 2000.
      ! All rights reserved.
      implicit none
      COMMON/EPCONT/EDEP,TSTEP,TUSTEP,USTEP,TVSTEP,VSTEP, RHOF,EOLD,ENEW
     *,EKE,ELKE,GLE,E_RANGE, x_final,y_final,z_final, u_final,v_final,w_
     *final, IDISC,IROLD,IRNEW,IAUSFL(31)
C$OMP0THREADPRIVATE(/EPCONT/)
      DOUBLE PRECISION EDEP
      real*8 TSTEP,  TUSTEP,  USTEP,  VSTEP,  TVSTEP,  RHOF,  EOLD,  ENE
     *W,  EKE,  ELKE,  GLE,  E_RANGE, x_final,y_final,z_final,  u_final,
     *v_final,w_final
      integer*4 IDISC,  IROLD,  IRNEW,  IAUSFL
      COMMON/STACK/ E(50),X(50),Y(50),Z(50),U(50),V(50),W(50),DNEAR(50),
     *WT(50),IQ(50),IR(50),LATCH(50), LATCHI,NP,NPold
C$OMP0THREADPRIVATE(/STACK/)
      DOUBLE PRECISION E
      real*8 X,Y,Z,  U,V,W,  DNEAR,  WT
      integer*4 IQ,  IR,  LATCH,  LATCHI, NP,  NPold
      COMMON/UPHIIN/SINC0,SINC1,SIN0(1002),SIN1(1002)
      real*8 SINC0,SINC1,SIN0,SIN1
      COMMON/UPHIOT/THETA,SINTHE,COSTHE,SINPHI, COSPHI,PI,TWOPI,PI5D2
C$OMP0THREADPRIVATE(/UPHIOT/)
      real*8 THETA,  SINTHE,  COSTHE,  SINPHI,  COSPHI,  PI,TWOPI,PI5D2
      common/randomm/ rng_array(128), urndm(97), crndm, cdrndm, cmrndm,
     *i4opt, ixx, jxx, fool_optimizer, twom24, rng_seed
C$OMP0THREADPRIVATE(/randomm/)
      integer*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti
     *mizer,rng_seed,rng_array
      real*4 twom24
      common /egs_io/ file_extensions(20), file_units(20), user_code,  i
     *nput_file,  output_file, pegs_file,  hen_house,  egs_home,  work_d
     *ir,  host_name,  n_parallel,  i_parallel,  first_parallel, n_max_p
     *arallel, n_chunk,  n_files, i_input,  i_log,  i_incoh,  i_nist_dat
     *a,  i_mscat,  i_photo_cs,  i_photo_relax,  xsec_out,  is_batch
      character input_file*256, output_file*256, pegs_file*256, file_ext
     *ensions*10, hen_house*128, egs_home*128, work_dir*128, user_code*6
     *4, host_name*64
      integer*4 n_parallel, i_parallel, first_parallel,n_max_parallel, n
     *_chunk, file_units, n_files,i_input,i_log,i_incoh, i_nist_data,i_m
     *scat,i_photo_cs,i_photo_relax, xsec_out
      logical is_batch
      integer IENTRY,LVL
      real*8 CTHET,  RNNO38,  PHI,  CPHI,  A,B,C,  SINPS2,  SINPSI,  US,
     *VS,  SINDEL,COSDEL
      integer*4 IARG,  LPHI,LTHETA,LCTHET,LCPHI
      real*8 xphi,xphi2,yphi,yphi2,rhophi2
      save CTHET,PHI,CPHI,A,B,C,SINPS2,SINPSI,US,VS,SINDEL,COSDEL
C$OMP0THREADPRIVATE(CTHET,PHI,CPHI,A,B,C,SINPS2)
C$OMP0THREADPRIVATE(SINPSI,US,VS,SINDEL,COSDEL)
      IARG=21
      IF ((IAUSFL(IARG+1).NE.0)) THEN
        CALL AUSGAB(IARG)
      END IF
      GO TO (6740,6750,6760),IENTRY
      GO TO 6770
6740  CONTINUE
      SINTHE=sin(THETA)
      CTHET=PI5D2-THETA
      COSTHE=sin(CTHET)
6750  CONTINUE
6781  CONTINUE
        IF((rng_seed .GT. 128))call ranmar_get
        xphi = rng_array(rng_seed)*twom24
        rng_seed = rng_seed + 1
        xphi = 2*xphi - 1
        xphi2 = xphi*xphi
        IF((rng_seed .GT. 128))call ranmar_get
        yphi = rng_array(rng_seed)*twom24
        rng_seed = rng_seed + 1
        yphi2 = yphi*yphi
        rhophi2 = xphi2 + yphi2
        IF(rhophi2.LE.1)GO TO6782
      GO TO 6781
6782  CONTINUE
      rhophi2 = 1/rhophi2
      cosphi = (xphi2 - yphi2)*rhophi2
      sinphi = 2*xphi*yphi*rhophi2
6760  GO TO (6790,6800,6810),LVL
      GO TO 6770
6790  A=U(NP)
      B=V(NP)
      C=W(NP)
      GO TO 6820
6810  A=U(NP-1)
      B=V(NP-1)
      C=W(NP-1)
6800  X(NP)=X(NP-1)
      Y(NP)=Y(NP-1)
      Z(NP)=Z(NP-1)
      IR(NP)=IR(NP-1)
      WT(NP)=WT(NP-1)
      DNEAR(NP)=DNEAR(NP-1)
      LATCH(NP)=LATCH(NP-1)
6820  SINPS2=A*A+B*B
      IF ((SINPS2.LT.1.0E-20)) THEN
        U(NP)=SINTHE*COSPHI
        V(NP)=SINTHE*SINPHI
        W(NP)=C*COSTHE
      ELSE
        SINPSI=SQRT(SINPS2)
        US=SINTHE*COSPHI
        VS=SINTHE*SINPHI
        SINDEL=B/SINPSI
        COSDEL=A/SINPSI
        U(NP)=C*COSDEL*US-SINDEL*VS+A*COSTHE
        V(NP)=C*SINDEL*US+COSDEL*VS+B*COSTHE
        W(NP)=-SINPSI*US+C*COSTHE
      END IF
      IARG=22
      IF ((IAUSFL(IARG+1).NE.0)) THEN
        CALL AUSGAB(IARG)
      END IF
      RETURN
6770  END

MSCAT and SSCAT also has some GO TO calls. Maybe this is the source of the problem?. As I stated before, Amplifier does not show any specific line as the source of the overhead, always shows the subroutine declaration, nothing more.

@John

Jim pointed above that THREADPRIVATE should have a low overhead cost, and doing some testing I found the same, so I do not think that that is the problem.

Thanks all for your help!!

TimP · ‎11-16-2014

Goto is no problem for openmp as long as it doesn't jump into or out of mop constructs. Saved variables and lack of 'recursive procedure declaration are big red flags.

egregious spaghetti code will at least require you to use asm view

jimdempseyatthecove · ‎11-17-2014

>>Looking at the code I realized that the first IF sentence never is true (rng_seed .EQ. 999999), therefore I decided to remove it. Unfortunately the result is disastrous

rng_seed is thread private... and should have been initialized to 999999 at program start.

Either that or call init_ranmar

PROGRAM foo
...
! first parallel region
!$OMP PARALLEL
call init_ranmar() ! or rng_seed = 999999
!$OMP END PARALLEL

Jim Dempsey

jimdempseyatthecove · ‎11-17-2014

Please use:

module mod_ThreadPrivate
      COMMON/EPCONT/EDEP,TSTEP,TUSTEP,USTEP,TVSTEP,VSTEP, RHOF,EOLD,ENEW &
     &,EKE,ELKE,GLE,E_RANGE, x_final,y_final,z_final, u_final,v_final, &
     &w_final, IDISC,IROLD,IRNEW,IAUSFL(31) 
!$OMP THREADPRIVATE(/EPCONT/) 
end module mod_ThreadPrivate
...
SUBROUTINE UPHI(IENTRY,LVL)
! Copyright..
! All ...
  use mod_ThreadPrivate
  implicit none

This avoids the issue of keeping all declarations of COMMON/EPCONT/ the same

Jim Dempsey

Edgardo_Doerner · ‎11-17-2014

@Jim,

the initialization of the ranmar rng is done outside ranmar_get at the beggining of the parallel region (in post #20 above is the piece of code), for that reason the IF sentence should never be entered.

I did not get the module idea, do I put the COMMON declaration inside the module, and then I declare just the needed variables outside?

@Tim

All the conflictive subroutine have SAVE variables... but I suppose that I would need a deeper study of the code to be able to modify them... well that is the result of more than 30 years of history of this MC code... hehehe. By the way, what do you mean with "lack of 'recursive procedure declaration"?

Thanks for your help!

jimdempseyatthecove · ‎11-17-2014

In the sample above, you would remove the COMMON/EPCONT/... from your subroutine UPHI and every other place that has COMMON/EPCONT/...

Instead you would have the use mod_ThreadPrivate (or whatever you want to name it).

By default, all the variables contained in the use'd modules are available to the program unit with the USE statement.

As you have now, one subroutine could have:

COMMON/EPCONT/EDEP,TSTEP,TUSTEP,USTEP,TVSTEP,VSTEP, RHOF,EOLD,ENEW ...

While a different one could have:

COMMON/EPCONT/EDEP,TSTEP,TUSTEP,USTEP,TVSTEP,VSTEP, NewSTEP, RHOF,EOLD,ENEW ...

Making code maintenance an issue

The module construct eliminates the maintenance issue.

*** You would have an issue doing this if different routines correctly used different variable names and/or types in the same named common.

Often in old programs you may see

COMMON/TEMPORARIES/ X,Y,Z

in one place and

COMMON/TEMPORARIES/ I,R,K

in a different place, all working well even different variable types in the same storage.

Jim Dempsey

TimP · ‎11-17-2014

Recursive declaration avoids dependence on compile options to avoid extra default SAVE variables which don't act as private.

Edgardo_Doerner · ‎11-17-2014

Tim Prince wrote:

Recursive declaration avoids dependence on compile options to avoid extra default SAVE variables which don't act as private.

Hi Tim, would it be possible for you to show an example of that?, I really do not get your idea and I have not be able to find additional info, only about the RECURSIVE statement on subroutine or function declarations, nothing about variables... Thanks for your help!

jimdempseyatthecove · ‎11-17-2014

EnDoemer.

subroutine foo
real :: position(3) ! local scoped array containing the position X,Y,Z
real :: temp ! local scoped scalar

In the above subroutine, without recursive attribute, and without the compiler options equivalently provide recursive-like attribute, the array position is implicitly SAVE, whereas the scalar temp is on stack. Meaning there is one copy of position shared by all threads calling the subroutine. If this subroutine were to be called concurrently by multiple threads without one of the attributes or options that require it be on stack, then there would be an error in the program.

The cure is one of:

recursive subroutine foo

(add Intel specific)
real, automatic :: position(3)

ifort /recursive /c foo.f90
ifort /auto /c foo.f90
ifort /Qopenmp /c foo.f90

Note, if you use or build a library, either static or dynamic, and if it were not built with /auto or /Qopenmp or /recursive or use of recursive on subroutine and function statements, then you must not link this library into a multi-threaded program as this code is not multi-thread safe. This is true even if there is no multi-threaded statements within the library.

Jim Dempsey

TimP · ‎11-18-2014

I agree also with concerns Jim expressed about common. I'm having difficulty commenting on tablet browser.

jimdempseyatthecove · ‎11-18-2014

Steve Lionel mentioned on a different thread that the next Fortran standard (2015?) that all functions and subroutines are implicitly recursive. Meaning, locally scoped arrays default to automatic... And thus may require you to explicitly use SAVE when you require save attribute (as opposed to currently not knowing if the arrays are SAVE or AUTOMATIC).

Jim Dempsey

John_Campbell · ‎11-18-2014

Tim,

>> Saved variables and lack of 'recursive procedure declaration are big red flags.

They are not big red flags, they simply do not work.

I don't understand why EnDoemer is persisting with common variables inside the !$OMP region if he wants any of them to be private. Either they should be shared or their final interaction should be defined. I would expect they would need some form of !$OMP REDUCTION(operation : variable) for this to work effectively.

I would recommend to review the code structure and remove the PRIVATE use of COMMON variables from within the !$OMP region.

Jim, I thought for many versions, the Fortran standard has implied that local variables are not static, but automatic or dynamic. Again, if a private variable is given the SAVE attribute, it's interaction and update between threads should be explicitly defined. Their value on entry, and what value is adopted on exit from the parallel region should all be defined. After all, why are they private, unless they take different values between threads inside the parallel region. This interaction should not be implicitly managed.

John

Edgardo_Doerner · ‎11-19-2014

@John

It is not that I "insist" in using COMMON variables, the original code was structured in that way and my idea, as a first step, is to keep the code closer to the original one. This platform (Electron Gamma Shower - EGS) have been used since several decades, and it is not the idea to do a radical change to it, at least as a first approach. It is clear for me that sooner or latter I should start to do major modifications to the code to improve further the performance of the code.

The main problem of eliminating the use of COMMON blocks that are private to each thread is that the structure of this platform uses them to
communicate data between the different subroutines (i.e. you have almost no subroutines with arguments). So it will imply a lot of
modifications to start to handle the private variables as arguments for the different subroutines. I finally I would like to do that (for example, to be able to use reduction clauses for the scoring variables, at this moment I have to use a CRITICAL section to add the results of each thread), but it will take some time.

Thanks for your comments.

jimdempseyatthecove · ‎11-19-2014

John,

Up until the next Fortran spec, scalars default to stack, others default to SAVE/static

From IVF:

By default, the compiler allocates local scalar variables on the stack. Other non-allocatable variables of non-recursive subprograms are allocated in static storage by default. This default can be changed through compiler options. Appropriate use of the SAVE attribute may be required if your program assumes that local variables retain their definition across subprogram calls.

To quote Dirty Harry: Do you feel lucky, ...

Jim

TimP · ‎11-19-2014

Openmp has support for thread private common blocks . It's not trivial to figure out, certainly beyond advice possible with limited view given here

Edgardo_Doerner · ‎11-19-2014

Tim Prince wrote:

Openmp has support for thread private common blocks . It's not trivial to figure out, certainly beyond advice possible with limited view given here

Well, I use the C$OMP0THREADPRIVATE clause for the COMMON blocks that are private to each thread, do you mean that?

John_Campbell · ‎11-19-2014

C$OMP0THREADPRIVATE looks to me as a way of ignoring important code design for OpenMP. In complex code, it is not easy to reproduce the validation process that went into the original single thread code. Importantly, which thread will define the exit values for the common variables.

There are many areas where OpenMP should be used carefully. I am yet to understand how to apply OpenMP to monte-carlo simulation approaches, as although there may be thread safe random number generators, will the different threads behave independently or will there be some correlation between the threads using an in-sequence pseudo random number generator.

These areas affect the usefulness of results obtained from OpenMP, especially when applied to complex calculations that can not be easily tested. Having now understood how to apply OpenMP to a skyline equation solver, my next challenge will be to see how to apply the process to an iterative Eigen solver, where these issues may need to be better understood.

John

[OpenMP] Huge overhead cost with OpenMP