Solved: Huge overhead cost with IFORT OpenMP - Part 2

Edgardo_Doerner · ‎12-03-2015

Dear all,

maybe some of you remember my first topic with this issue https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/532179. I am working with a code written in Fortran 77 and I was able to introduce parallelism using OpenMP. However, the
overhead effects are very high, especially when the Intel Fortran Compiler is used. For example, for a Intel Core i7 with 8 Threads I have the following execution times (in seconds):

Serial code: 811.3
ifort OpenMP, 1 thread: 864.4
ifort OpenMP, 8 threads: 169.1

So, you can see that the overhead introduced to the program when OpenMP is enabled is around 6.5% of the serial time, that seems excessive.
After some time and following the advices given in the first post I decided to give a second try to this issue. I profiled my program using Amplifier XE and the report shows a couple of subroutines that share most of the overhead time. For example, consider the following subroutine:

      REAL*8 FUNCTION spin_rejection(qel,medium,elke,beta2,q1,cost, spin
     *_index,is_single)
      IMPLICIT NONE
      REAL*8 elke,beta2,q1,cost
      INTEGER*4 qel,medium
      LOGICAL spin_index,is_single
      COMMON/SPIN_DATA/ spin_rej(1,0:1,0: 31,0:15,0:31), espin_min,espin
     *_max,espml,b2spin_min,b2spin_max, dbeta2,dbeta2i,dlener,dleneri,dq
     *q1,dqq1i, fool_intel_optimizer
      REAL*4 spin_rej,espin_min,espin_max,espml,b2spin_min,b2spin_max, d
     *beta2,dbeta2i,dlener,dleneri,dqq1,dqq1i
      LOGICAL fool_intel_optimizer
      COMMON/RANDOMM/ rng_array(128), urndm(97), crndm, cdrndm, cmrndm,
     *i4opt, ixx, jxx, fool_optimizer, twom24, rng_seed
C$OMP0THREADPRIVATE(/RANDOMM/)
      INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti
     *mizer,rng_seed,rng_array
      REAL*4 twom24
      REAL*8 rnno,ai,qq1,aj,xi,ak
      INTEGER*4 i,j,k
      SAVE i,j
C$OMP0THREADPRIVATE(i,j)
      IF (( spin_index )) THEN
      spin_index = .false.
      IF (( beta2 .GE. b2spin_min )) THEN
      ai = (beta2 - b2spin_min)*dbeta2i
      i = ai
      ai = ai - i
      i = i + 15 + 1
      ELSE IF(( elke .GT. espml )) THEN
      ai = (elke - espml)*dleneri
      i = ai
      ai = ai - i
      ELSE
      i = 0
      ai = -1
      END IF
      IF((rng_seed .GT. 128))CALL ranmar_get
      rnno = rng_array(rng_seed)*twom24
      rng_seed = rng_seed + 1
      IF((rnno .LT. ai))i = i + 1
      IF (( is_single )) THEN
      j = 0
      ELSE
      qq1 = 2*q1
      qq1 = qq1/(1 + qq1)
      aj = qq1*dqq1i
      j = aj
      IF (( j .GE. 15 )) THEN
      j = 15
      ELSE
      aj = aj - j
      IF((rng_seed .GT. 128))CALL ranmar_get
      rnno = rng_array(rng_seed)*twom24
      rng_seed = rng_seed + 1
      IF((rnno .LT. aj))j = j + 1
      END IF
      END IF
      END IF
      xi = SQRT(0.5*(1-cost))
      ak = xi*31
      k = ak
      ak = ak - k
      spin_rejection = (1-ak)*spin_rej(medium,qel,i,j,k) + ak*spin_rej(m
     *edium,qel,i,j,k+1)
      RETURN
      END

The analysis shows that a lot of overhead occurs when the ranmar_get subroutine is called. It corresponds to:

      SUBROUTINE ranmar_get
      IMPLICIT NONE
      COMMON/RANDOMM/ rng_array(128), urndm(97), crndm, cdrndm, cmrndm,
     *i4opt, ixx, jxx, fool_optimizer, twom24, rng_seed
C$OMP0THREADPRIVATE(/RANDOMM/)
      INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti
     *mizer,rng_seed,rng_array
      REAL*4 twom24
      INTEGER*4 i,iopt
      IF((rng_seed .EQ. 999999))CALL init_ranmar
      DO 2511 i=1,128
      iopt = urndm(ixx) - urndm(jxx)
      IF((iopt .LT. 0))iopt = iopt + 16777216
      urndm(ixx) = iopt
      ixx = ixx - 1
      jxx = jxx - 1
      IF ((ixx .EQ. 0)) THEN
      ixx = 97
      ELSE IF(( jxx .EQ. 0 )) THEN
      jxx = 97
      END IF
      crndm = crndm - cdrndm
      IF((crndm .LT. 0))crndm = crndm + cmrndm
      iopt = iopt - crndm
      IF((iopt .LT. 0))iopt = iopt + 16777216
      rng_array(i) = iopt
2511  CONTINUE
2512  CONTINUE
      rng_seed = 1
      RETURN
      END

Well, I decided to give a look into the assembly of the spin_rejection function and, for example, at the line when ranmar_get is first called I obtained the following:

0x4202fd Block 196:
0x4202f8 9,019 callq 0x401bc0 <for_write_seq_lis_xmit>
0x4202f0 9,019 lea 0x340(%rsp), %rdx
0x4202e9 9,019 movl %r13d, 0x2d0(%rdi)
0x4202e4 9,019 lea 0x70(%rsp), %rdi
0x4202df 9,019 mov $0x480b14, %esi
0x4202df Block 195:
0x4202da 9,019 callq 0x401bc0 <for_write_seq_lis_xmit>
0x4202d2 9,019 lea 0x320(%rsp), %rdx
0x4202cb 9,019 movl %r15d, 0x2b0(%rdi)
0x4202c6 9,019 lea 0x70(%rsp), %rdi
0x4202c1 9,019 mov $0x480b0c, %esi

[...]

that pattern repeats through 500 lines of the assembly code... At the exit of the spin_rejection function I have several calls to the functions __kmpc_threadprivate_cached and __kmpc_global_thread_num, with a huge overhead time for the latter.

For some reason and remembering that someone points out that maybe is something related to bounds checking of arrays so I decided to reformat the RNG to avoid any use of arrays. Essentially now the generator is called each time a random number is needed and not each a certain number of expended random numbers. So now I have the following version of the RNG:

      SUBROUTINE ranmar_get
      IMPLICIT NONE
      COMMON/RANDOMM/ rng_array, urndm(97), crndm, cdrndm, cmrndm, i4opt
     *, ixx, jxx, fool_optimizer, twom24, rng_seed
C$OMP0THREADPRIVATE(/RANDOMM/)
      INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti
     *mizer,rng_seed,rng_array
      REAL*4 twom24
      INTEGER*4 iopt
      iopt = urndm(ixx) - urndm(jxx)
      IF((iopt .LT. 0))iopt = iopt + 16777216
      urndm(ixx) = iopt
      ixx = ixx - 1
      jxx = jxx - 1
      IF ((ixx .EQ. 0)) THEN
      ixx = 97
      ELSE IF(( jxx .EQ. 0 )) THEN
      jxx = 97
      END IF
      crndm = crndm - cdrndm
      IF((crndm .LT. 0))crndm = crndm + cmrndm
      iopt = iopt - crndm
      IF((iopt .LT. 0))iopt = iopt + 16777216
      rng_array = iopt
      RETURN
      END

the main change is that now rng_array is a single integer variable and not an array of a certain number of elements. So now the calling sequence in spin_rejection is simple CALL ranmar_get and I obtain the following assembly code for that section:

0x417ceb	8,959	sub %eax, %r8d
0x417ceb		Block 25:
0x417ce4	8,959	addl  0x190(%r15), %eax
0x417ce4		Block 24:
0x417ce2	8,959	jns 0x417ceb <Block 25>
0x417ce0	8,959	sub %ebp, %eax
0x417cd9	8,959	movl  0x188(%r15), %eax
0x417cd2	8,959	movl  0x18c(%r15), %ebp
0x417cd2		Block 23:
0x417ccf	8,959	cmovz %eax, %ecx
0x417ccd	8,959	test %ecx, %ecx
0x417cc8	8,959	mov $0x61, %eax
0x417cc8		Block 22:
0x417cc6	8,959	jmp 0x417cd2 <Block 23>
0x417cc1	8,959	mov $0x61, %edx
0x417cc1		Block 21:
0x417cbf	8,959	jnz 0x417cc8 <Block 22>
0x417cbd	8,959	dec %edx
0x417cb9	8,959	movl  %r8d, (%r15,%rdx,4)
0x417cb5	8,959	cmovs %eax, %r8d
0x417cae	8,959	lea 0x1000000(%r8), %eax
0x417cab	8,959	test %r8d, %r8d
0x417ca9	8,959	dec %ecx
0x417ca5	8,959	subl  (%r15,%rcx,4), %r8d
0x417ca1	8,959	movl  (%r15,%rdx,4), %r8d
0x417c9e	8,959	movsxd %ecx, %rcx
0x417c97	8,959	movl  0x19c(%r15), %ecx
0x417c94	8,959	movsxd %edx, %rdx
0x417c8d	8,959	movl  0x198(%r15), %edx

that is the entire assembly code for the first call of ranmar_get. Well, it is much simpler that the original version. The overhead cost of the call to ranmar_get inside spin_rejection also decreases heavily. __kmpc_threadprivate_cached and __kmpc_global_thread_num are also called at the end of the spin_rejection subroutine, but the overhead cost is much lower. I have starting believing that the use of arrays in the RNG is the source, in some way, of the overhead when OpenMP is enabled within the ifort compiler. Well, I have even tested other RNG with much simpler structures and I do not have any overhead cost pointed out by Amplifier XE. Even __kmpc_threadprivate_cached and __kmpc_global_thread_num are not called at all at the end of the spin_rejection subroutine.

I would like to know if something related with arrays could be the source of the overhead costs in my program, and if there is a way to alleviate this problem. I am not a programming expert so it is quite difficult to diagnose this problem and it limits the scalability of my program.

Thanks for your help!

jimdempseyatthecove · ‎12-28-2015

In spin_ranmar_rejection_test, remove i,j from being declared as threadprivate and save, and place into the tread private RNG_P, and initialize at initialization of other data members. The Dummy argument RNG will then be used to reference its enclosed i an j. The de-reference of the Dummy RNG should be faster than the locating of the tls data then dereferencing that. To avoid confusion, I suggest you make the thread private COMMON name (holding RNG_P) as tlsRNG.

      MODULE RANMAR_MOD
            
        TYPE RANMAR_T
          integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx,  
     *fool_optimizer,rng_seed,rng_array(128)
          integer*4 :: i,j
          real*4 twom24
        END TYPE RANMAR_T

        TYPE(RANMAR_T), POINTER :: RNG_P 
          COMMON/tlsRNG/ RNG_P 
C$OMP0THREADPRIVATE(/tlsRNG/)

      END MODULE RANMAR_MOD

and

      REAL*8 FUNCTION spin_rejection_test(qel,medium,elke,beta2,q1,cost,
     *spin_index,is_single,RNG)
      USE RANMAR_MOD
      implicit none
      TYPE(RANMAR_T) :: RNG
      ...
C      integer*4 i,j,k
C      save i,j
C C$OMP0THREADPRIVATE(i,j)
        integer*4 k
      IF (( spin_index )) THEN
        spin_index = .false.
        IF (( beta2 .GE. b2spin_min )) THEN
          ai = (beta2 - b2spin_min)*dbeta2i
          RNG%i = ai
          ai = ai - RNG%i
          RNG%i =  RNG%i + 15 + 1
        ELSE IF(( elke .GT. espml )) THEN
          ai = (elke - espml)*dleneri
          RNG%i = ai
          ai = ai - RNG%i
        ELSE
          RNG%i = 0
          ai = -1
        END IF
... continue changing references to i and j

Note, you can use ASSOCIATE i => RNG%I And same with j (remember to END ASSOCIATE).

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎12-03-2015

I do not see anything in your provided code that is calling for_write_seq_lis_xmit.

It would be helpful if you can show your timed section (you can excise code unrelated to OpenMP region entry/exit/control)

If you have issues with overhead of functions __kmpc_threadprivate_cached and __kmpc_global_thread_num then I suggest you change your strategy from using Thread Local Storage for your random number generator to using an array of user defined types (containing what was in your COMMON TLS block), and then presumably you have a loop in your parallel region, separate the start of the parallel region from the parallel loop and obtain the OpenMP global thread number once outside the parallel loop. This will reduce the overhead to once per loop (per thread) from once per iteration of loop.

Jim Dempsey

Edgardo_Doerner · ‎12-04-2015

Hi Jim,

This is the (only) parallel region of my code:

C$OMP0PARALLEL PRIVATE(omp_iam) COPYIN(/EGS_VR/,/EPCONT/)
C$OMP*COPYIN(/ET_CONTROL/,/PHOTIN/,/RANDOMM/,/UPHIOT/,/USEFUL/,/SCORE/)
C$OMP*NUM_THREADS(omp_tot)
#ifdef _OPENMP
      omp_iam = OMP_GET_THREAD_NUM()
#endif
      ixx = 1802 + omp_iam
      jxx = 9373
      CALL init_ranmar
C$OMP0DO SCHEDULE(static)
      DO 1051 omp_icase=1,ncase
      icase = omp_icase
      CALL shower(iqin,ein,xin,yin,zin,uin,vin,win,irin,wtin)
C$OMP0CRITICAL (PROGRESS)
      ibatch = ibatch + 1
      IF (( MOD(ibatch,nperbatch) .EQ. 0)) THEN
      WRITE(6,'(a,i2,a,i2)') '+ Finished batch ',ibatch/nperbatch, ' out
     * of ',nbatch
      END IF
C$OMP0END CRITICAL (PROGRESS)
1051  CONTINUE
1052  CONTINUE
C$OMP0END PARALLEL

So I have already separated the start of the parallel region from the parallel loop. The shower function then calls all the needed functions and subroutines, among them spin_rejection.

About the use of the TLS in the RNG, do you mean to create a structure like the following:

      TYPE RANDOMM
      INTEGER urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx, 
      fool_optimizer, rng_seed, rng_array(128)
      REAL twom24
      END TYPE RANDOMM

and then declare an array, where each element is used by one of the executing threads?.

About the call to for_write_seq_lis_xmit, maybe I can put the code of the subroutine from where spin_rejection is called:

      SUBROUTINE sscat(chia2,elke,beta2,qel,medium,spin_effects,cost,sin
     *t)
      IMPLICIT NONE
      REAL*8 chia2,elke,beta2,cost,sint
      INTEGER*4 qel,medium
      LOGICAL spin_effects
      COMMON/RANDOMM/ rng_array(128), urndm(97), crndm, cdrndm, cmrndm,
     *i4opt, ixx, jxx, fool_optimizer, twom24, rng_seed
C$OMP0THREADPRIVATE(/RANDOMM/)
      INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti
     *mizer,rng_seed,rng_array
      REAL*4 twom24
      REAL*8 xi,rnno,rejf,spin_rejection,qzero
      LOGICAL spin_index
      spin_index = .true.
5080  CONTINUE
      IF((rng_seed .GT. 128))CALL ranmar_get
      xi = rng_array(rng_seed)*twom24
      rng_seed = rng_seed + 1
      xi = 2*chia2*xi/(1 - xi + chia2)
      cost = 1 - xi
      IF (( spin_effects )) THEN
      qzero=0
      rejf = spin_rejection(qel,medium,elke,beta2,qzero,cost,spin_index,
     *.true.)
      IF((rng_seed .GT. 128))CALL ranmar_get
      rnno = rng_array(rng_seed)*twom24
      rng_seed = rng_seed + 1
      IF((rnno .GT. rejf))GOTO 5080
      END IF
      sint = SQRT(xi*(2 - xi))
      RETURN
      END

I have also attached the entire code, if that could be of help. Thanks for your help!

jimdempseyatthecove · ‎12-04-2015

What does Amplifier(VTune) now show as the leading cause for overhead?

If the OpenMP related overhead is insignificant then there is nothing more to do.

If significant overhead continues to be functions __kmpc_threadprivate_cached and __kmpc_global_thread_num then I suggest you change your strategy from using Thread Local Storage to using a user defined types (containing what was in your COMMON TLS blocks).

I suggest adding a module that defines a user defined type containing what was in your commons that were thread private. This will require making them not COMMON. You might also want to place into the module the shared variables (not as COMMON). As an example you could call the user defined type as Context_t Then adding to your main program a USE of that module and pointer (or allocatable) of that type.

type(Context_t), pointer :: Context => NULL() ! try pointer first, then switch over to ALLOCATABLE later

and, add that to the PRIVATE clause and COPYIN of the (only) parallel region.
and, inside the parallel region but before the parallel DO, add an ALLOCATE(Context) so each thread has its own Context.
Then add Context as the first argument to the subroutine call lists, and all routines beneath those that require the context.

You will have to modify the called routines that require the context to have the Context listed, and a USE for that context.

The member variables will now be accessible via the Context%YourMemberVariableHere.

Note, there is a fair amount of work to do this, so weigh the work against the potential performance gain.

For some operating systems on IA32 and Intel64, they tie the Thread Local Storage to the descriptor of the FS or GS Selector. On these systems accessing TLS has almost no overhead (requires a segment override prefix to the instruction op code referencing the memory in TLS). For other systems, they require a round about way of locating the TLS, and this is performed by a function call (possibly __kmpc_threadprivate_cached and __kmpc_global_thread_num ) .

Jim Dempsey

jimdempseyatthecove · ‎12-04-2015

If/when you have multiple parallel regions, and you wish to pass the context from region to region you can place the pointer in TLS

*** but then use a local copy of the pointer in the parallel region ***

Jim Dempsey

Edgardo_Doerner · ‎12-14-2015

Hi Jim,

first sorry for the delay in the answer, but the last week was a "hell of a week" at my university (a.k.a. end of semester)

Well, first the original version of the code has the following Concurrency Analysis results in Amplifier XE:

So in that image it is clear that a huge portion of the overhead time is due to spin_rejection and ausgab subroutines. When I modified the RNG I obtained the following analysis at Amplifier XE:

I will try your approach of avoiding the TLS. By the way, this overhead does not happen with gfortran... for example, the overhead of adding OpenMP support for 1 thread is just about 1.2 %, that I suppose is quite acceptable. Gfortran handles TLS in a different way than IFORT?. Maybe I could just switch to gfortran, but I also want to use this code with Intel Xeon Phi accelerators, so I would like to be able to solve this issue... thanks for your help!.

jimdempseyatthecove · ‎12-14-2015

>>Gfortran handles TLS in a different way than IFORT?.

I suspect so.

In both screenshots, the call to __kmp_get_global_thread_id_reg is the predominant overhead.

Place what is currently in your thread local storage into a user defined type. Then create a thread local instance of this type.

type    TypeThreadContext
... ! your declarations here
end type TypeThreadContext

type(TypeThreadContext) :: ThreadContext
COMMON /tlsCONTEXT/ ThreadContext
!$OMP THREADPRIVATE(/tlsCONTEXT/)

Then in the routines that use the context multiple times, declare a pointer to the thread context

type(TypeThreadContext), pointer :: Context
...
Context => ThreadContext

Then use Context to reference the member variables.

If you still have significant number of calls to __kmp_get_global_thread_id_reg, then consider obtaining the pointer to the thread context higher up in the call level and pass the pointer (reference) to the context object as part of the argument list on the calls.

Jim Dempsey

Edgardo_Doerner · ‎12-18-2015

Thanks for your answer... I am trying to implement your suggestion but I have problems with the use of pointers. I have really tried to find information about this without success... so the question is how to pass a pointer as an argument to a subroutine without loosing its association. For example, as you say I have created an user defined type containing the state of the RNG:

      TYPE RANMAR_T
      integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_
     *optimizer,rng_seed,rng_array(128)
      real*4 twom24
      END TYPE RANMAR_T
      
      COMMON/RNG/ RNG_STATE
C$OMP0THREADPRIVATE(/RNG/)
      TYPE(RANMAR_T), TARGET :: RNG_STATE

[...]

      TYPE(RANMAR_T), POINTER ::  RNG_P

Then inside de parallel region I associate a private copy for each thread of the pointer RNG_P with RNG_STATE. I am able to define the seeds of the RNG through the pointer

C$OMP0PARALLEL PRIVATE(omp_iam,RNG_P) COPYIN(/EGS_VR/,/EPCONT/)
C$OMP*COPYIN(/ET_CONTROL/,/PHOTIN/,/RANDOMM/,/UPHIOT/,/USEFUL/,/SCORE/)
[...]      
      RNG_P => RNG_STATE
      RNG_P%ixx = 1802
      RNG_P%jxx = 9373 + omp_iam
      
      call init_ranmar_test(RNG_P)
[...]

when entering init_ranmar_test subroutine the pointer lose its association with RNG_STATE, triggering an error during execution. This subroutine looks like this:

      subroutine init_ranmar_test(RNG_P)
      implicit none

      TYPE RANMAR_T
      integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_
     *optimizer,rng_seed,rng_array(128)
      real*4 twom24
      END TYPE RANMAR_T

      TYPE(RANMAR_T), POINTER :: RNG_P

      integer*4 s,t
      integer*4 i,j,k,l,m,ii,jj
      WRITE(*,*) ASSOCIATED(RNG_P)
      WRITE(*,*) RNG_P%ixx, RNG_P%jxx
      IF((RNG_P%ixx .LE. 0 .OR. RNG_P%ixx .GT. 31328))RNG_P%ixx = 1802
      IF((RNG_P%jxx .LE. 0 .OR. RNG_P%jxx .GT. 30081))RNG_P%jxx = 9373
      i = mod(RNG_P%ixx/177,177) + 2
      j = mod(RNG_P%ixx, 177) + 2
      k = mod(RNG_P%jxx/169,178) + 1
      l = mod(RNG_P%jxx, 169)
[...]

So the use with ASSOCIATED returns false and finally the program crashes if it is tried to use the pointer. I have tried to find an answer to how to properly give a pointer as an argument to a subroutine in Fortran without success, it seems a quite obscure topic and from my point of view (or lack of expertise) it is quite difficult to understand... thanks for your help!

jimdempseyatthecove · ‎12-18-2015

module RANMAR_MOD      
TYPE RANMAR_T
      integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_
     *optimizer,rng_seed,rng_array(128)
      real*4 twom24
      END TYPE RANMAR_T
      
      COMMON/RNG/ RNG_STATE
C$OMP0THREADPRIVATE(/RNG/)
      TYPE(RANMAR_T), POINTER ::  RNG_P
end module RANMAR_MOD

[...]
! once_only.f90 startup once only code (place call near startup)
subroutine once_only
  use RANMAR_MOD
!$OMP PARALLEL
ALLOCATE(RNG_P)
!$OMP END PARALLEL
end subroutine once_only

Then use:

C$OMP0PARALLEL PRIVATE(omp_iam) COPYIN(/EGS_VR/,/EPCONT/)
C$OMP*COPYIN(/ET_CONTROL/,/PHOTIN/,/RANDOMM/,/UPHIOT/,/USEFUL/,/SCORE/)
[...]      
      RNG_P%ixx = 1802 ! you can access via TLS pointer
      RNG_P%jxx = 9373 + omp_iam ! you can access via TLS pointer
      ! and/or
      call init_ranmar_test(RNG_P) ! produce reference to the object of TLS pointer
      ! note, above call calls __kmp_get_global_thread_id_reg once to produce reference
[...]

Then elsewhere

      subroutine init_ranmar_test(RNG) ! note, no "_P"
      implicit none
      use RANMAR_MOD ! expose RANMAR_T (and unused RNG_P in TLS)

      TYPE(RANMAR_T) :: RNG ! reference not pointer

      integer*4 s,t
      integer*4 i,j,k,l,m,ii,jj
      WRITE(*,*) RNG%ixx, RNG%jxx  ! note, no "_P"
      IF((RNG%ixx .LE. 0 .OR. RNG%ixx .GT. 31328))RNG%ixx = 1802
      IF((RNG%jxx .LE. 0 .OR. RNG%jxx .GT. 30081))RNG%jxx = 9373
      i = mod(RNG%ixx/177,177) + 2
      j = mod(RNG%ixx, 177) + 2
      k = mod(RNG%jxx/169,178) + 1
      l = mod(RNG%jxx, 169)
[...]

The above subroutine has does not call __kmp_get_global_thread_id_reg.

Jim Dempsey

Edgardo_Doerner · ‎12-18-2015

Hi Jim,

I was looking at your code and shouldn't be RNG_P inside the module also threadprivate?

I will introduce the changes, thanks for the help!

jimdempseyatthecove · ‎12-18-2015

You are correct, my mistake

module RANMAR_MOD      
TYPE RANMAR_T
      integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_
     *optimizer,rng_seed,rng_array(128)
      real*4 twom24
      END TYPE RANMAR_T
      
      TYPE(RANMAR_T), POINTER ::  RNG_P
      COMMON /RNG/ RNG_P
C$OMP0THREADPRIVATE(/RNG/)
      
end module RANMAR_MOD

Too quick at copy and paste

Thanks for pointing out my error

Jim Dempsey

Edgardo_Doerner · ‎12-28-2015

Dear Jim, I have introduced your suggested changes without success... I defined the following type:

      MODULE RANMAR_MOD
            
        TYPE RANMAR_T
          integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx,  
     *fool_optimizer,rng_seed,rng_array(128)
          real*4 twom24
        END TYPE RANMAR_T

        TYPE(RANMAR_T), POINTER :: RNG_P 
          COMMON/RNG/ RNG_P 
C$OMP0THREADPRIVATE(/RNG/)

      END MODULE RANMAR_MOD

and allocated the pointer at the beginning of the program using the following subroutine:

      SUBROUTINE ONCE_ONLY
        USE RANMAR_MOD

C$OMP0PARALLEL
        ALLOCATE(RNG_P)
C$OMP0END PARALLEL

      END SUBROUTINE ONCE_ONLY

then initialise it with the following:

      SUBROUTINE init_ranmar_test(RNG)
      USE RANMAR_MOD
      implicit none
      TYPE(RANMAR_T) :: RNG
      integer*4 s,t
      integer*4 i,j,k,l,m,ii,jj

      IF((RNG%ixx .LE. 0 .OR. RNG%ixx .GT. 31328))RNG%ixx = 1802
      IF((RNG%jxx .LE. 0 .OR. RNG%jxx .GT. 30081))RNG%jxx = 9373
      i = mod(RNG%ixx/177,177) + 2
      j = mod(RNG%ixx, 177) + 2
      k = mod(RNG%jxx/169,178) + 1
      l = mod(RNG%jxx, 169)
      DO 2631 ii=1,97
        s = 0 
        t = 8388608
        DO 2641 jj=1,24
          m = mod(mod(i*j,179)*k,179)
          IF (( RNG%fool_optimizer .EQ. 999 )) THEN
            write(6,*) i,j,k,m,s,t
          END IF
          i = j 
          IF (( RNG%fool_optimizer .EQ. 999 )) THEN
            write(6,*) i,j,k,m,s,t
          END IF
          j = k 
          IF (( RNG%fool_optimizer .EQ. 999 )) THEN
            write(6,*) i,j,k,m,s,t
          END IF
          k = m 
          IF (( RNG%fool_optimizer .EQ. 999 )) THEN
            write(6,*) i,j,k,m,s,t
          END IF
          l = mod(53*l+1,169)
          IF (( RNG%fool_optimizer .EQ. 999 )) THEN
            write(6,*) i,j,k,m,s,t
          END IF
          IF((mod(l*m,64) .GE. 32))s = s + t
          IF (( RNG%fool_optimizer .EQ. 999 )) THEN
            write(6,*) i,j,k,m,s,t
          END IF
          t = t/2
          IF (( RNG%fool_optimizer .EQ. 999 )) THEN
            write(6,*) i,j,k,m,s,t
          END IF
2641    CONTINUE
2642    CONTINUE
        RNG%urndm(ii) = s
2631  CONTINUE
2632  CONTINUE

and finally, in order to generate the random numbers I defined this:

      SUBROUTINE ranmar_get_test(RNG)
      USE RANMAR_MOD
      implicit none
      TYPE(RANMAR_T) :: RNG
      integer*4 i,iopt

      IF((RNG%rng_seed .EQ. 999999))call init_ranmar_test(RNG)
      DO 2651 i=1,128
        iopt = RNG%urndm(RNG%ixx) - RNG%urndm(RNG%jxx)
        IF((iopt .LT. 0))iopt = iopt + 16777216
        RNG%urndm(RNG%ixx) = iopt
        RNG%ixx = RNG%ixx - 1
        RNG%jxx = RNG%jxx - 1
        IF ((RNG%ixx .EQ. 0)) THEN
          RNG%ixx = 97
        ELSE IF(( RNG%jxx .EQ. 0 )) THEN
          RNG%jxx = 97
        END IF
        RNG%crndm = RNG%crndm - RNG%cdrndm
        IF((RNG%crndm .LT. 0))RNG%crndm = RNG%crndm + RNG%cmrndm
        iopt = iopt - RNG%crndm
        IF((iopt .LT. 0))iopt = iopt + 16777216
        RNG%rng_array(i) = iopt
2651  CONTINUE
2652  CONTINUE
      RNG%rng_seed = 1

      RETURN
      END SUBROUTINE ranmar_get_test

For example, the subroutine spin_rejection which uses this RNG looks like:

      REAL*8 FUNCTION spin_rejection_test(qel,medium,elke,beta2,q1,cost,
     *spin_index,is_single,RNG)
      USE RANMAR_MOD
      implicit none
      TYPE(RANMAR_T) :: RNG
      real*8 elke,beta2,q1,cost
      integer*4 qel,medium
      logical spin_index,is_single
      common/spin_data/ spin_rej(5,0:1,0: 31,0:15,0:31), espin_min,espin
     *_max,espml,b2spin_min,b2spin_max, dbeta2,dbeta2i,dlener,dleneri,dq
     *q1,dqq1i, fool_intel_optimizer
      real*4 spin_rej,espin_min,espin_max,espml,b2spin_min,b2spin_max, d
     *beta2,dbeta2i,dlener,dleneri,dqq1,dqq1i
      logical fool_intel_optimizer
      real*8 rnno,ai,qq1,aj,xi,ak
      integer*4 i,j,k
      save i,j
C$OMP0THREADPRIVATE(i,j)
      IF (( spin_index )) THEN
        spin_index = .false.
        IF (( beta2 .GE. b2spin_min )) THEN
          ai = (beta2 - b2spin_min)*dbeta2i
          i = ai
          ai = ai - i
          i = i + 15 + 1
        ELSE IF(( elke .GT. espml )) THEN
          ai = (elke - espml)*dleneri
          i = ai
          ai = ai - i
        ELSE
          i = 0
          ai = -1
        END IF
        IF((RNG%rng_seed .GT. 128)) THEN
          call ranmar_get_test(RNG)
        END IF
        rnno = RNG%rng_array(RNG%rng_seed)*RNG%twom24
        RNG%rng_seed = RNG%rng_seed + 1
        IF((rnno .LT. ai))i = i + 1
        IF (( is_single )) THEN
          j = 0
        ELSE
          qq1 = 2*q1
          qq1 = qq1/(1 + qq1)
          aj = qq1*dqq1i
          j = aj
          IF (( j .GE. 15 )) THEN
            j = 15
          ELSE
            aj = aj - j
            IF((RNG%rng_seed .GT. 128)) THEN
              call ranmar_get_test(RNG)              
            END IF
            rnno = RNG%rng_array(RNG%rng_seed)*RNG%twom24
            RNG%rng_seed = RNG%rng_seed + 1
            IF((rnno .LT. aj))j = j + 1
          END IF
        END IF
      END IF
      xi = Sqrt(0.5*(1-cost))
      ak = xi*31
      k = ak
      ak = ak - k
      spin_rejection_test = (1-ak)*spin_rej(medium,qel,i,j,k) + ak*spin_
     *rej(medium,qel,i,j,k+1)

      RETURN
      END FUNCTION spin_rejection_test

and well, looking at the amplifier analysis and the assembly code the "call ranmar_get..." sentences continues to be the problematic line. As you can see in the code fragments above, I have tried to obtain the pointer at different places in the call sequence (even at the main program!) and the situation is the same. Here a extract of the assembly shown by Vtune Amplifier XE for the first call to ranmar_get_test:

Address	Source Line	Assembly	Source Line	Assembly
0x420ee4	149	test %ebx, %ebx		[Unknown]
0x420ee6	149	jle 0x420ef0 <Block 126>		[Unknown]
0x420ee8		Block 125:		[Unknown]
0x420ee8	149	cmp $0x7a60, %ebx		[Unknown]
0x420eee	149	jle 0x420eff <Block 127>		[Unknown]
0x420ef0		Block 126:		[Unknown]
0x420ef0	149	movl  $0x70a, 0x194(%rcx)		[Unknown]
0x420efa	149	mov $0x70a, %ebx		[Unknown]
0x420eff		Block 127:		[Unknown]
0x420eff	149	movl  0x198(%rcx), %ebp		[Unknown]
0x420f05	149	test %ebp, %ebp		[Unknown]
0x420f07	149	jle 0x420f11 <Block 129>		[Unknown]
0x420f09		Block 128:		[Unknown]
0x420f09	149	cmp $0x7581, %ebp		[Unknown]
0x420f0f	149	jle 0x420f20 <Block 130>		[Unknown]
0x420f11		Block 129:		[Unknown]
0x420f11	149	movl  $0x249d, 0x198(%rcx)		[Unknown]
0x420f1b	149	mov $0x249d, %ebp		[Unknown]
0x420f20		Block 130:		[Unknown]
0x420f20	149	mov $0xb92143fb, %eax		[Unknown]
0x420f25	149	mov %ebx, %edi		[Unknown]
0x420f27	149	imul %ebx		[Unknown]
0x420f29	149	mov %edx, %esi		[Unknown]
0x420f2b	149	mov $0xb92143fb, %eax		[Unknown]
0x420f30	149	add %ebx, %esi		[Unknown]
0x420f32	149	sar $0x7, %esi		[Unknown]
0x420f35	149	sar $0x1f, %edi		[Unknown]
0x420f38	149	sub %edi, %esi		[Unknown]
0x420f3a	149	imul %esi		[Unknown]
0x420f3c	149	mov %esi, %r11d		[Unknown]
0x420f3f	149	add %esi, %edx		[Unknown]
0x420f41	149	sar $0x7, %edx		[Unknown]
0x420f44	149	sar $0x1f, %r11d		[Unknown]
0x420f48	149	sub %r11d, %edx		[Unknown]
0x420f4b	149	imul $0xffffff4f, %edx, %eax		[Unknown]
0x420f51	149	imul $0xffffff4f, %esi, %r10d		[Unknown]
0x420f58	149	movsdq  %xmm0, (%rsp)		[Unknown]
0x420f5d	149	movq  %r8, 0x18(%rsp)		[Unknown]
0x420f62	149	lea 0x2(%rsi,%rax,1), %r9d		[Unknown]
0x420f67	149	mov $0x60f25deb, %eax		[Unknown]
0x420f6c	149	lea 0x2(%rbx,%r10,1), %r10d		[Unknown]
0x420f71	149	imul %ebp		[Unknown]
0x420f73	149	mov %edx, %ebx		[Unknown]
0x420f75	149	mov %ebp, %esi		[Unknown]
0x420f77	149	sar $0x6, %ebx		[Unknown]
0x420f7a	149	mov $0xb81702e1, %eax		[Unknown]
0x420f7f	149	sar $0x1f, %esi		[Unknown]
0x420f82	149	sub %esi, %ebx		[Unknown]
0x420f84	149	xor %esi, %esi		[Unknown]
0x420f86	149	imul %ebx		[Unknown]
0x420f88	149	mov %ebx, %eax		[Unknown]
0x420f8a	149	add %ebx, %edx		[Unknown]
0x420f8c	149	sar $0x7, %edx		[Unknown]
0x420f8f	149	sar $0x1f, %eax		[Unknown]
0x420f92	149	sub %eax, %edx		[Unknown]
0x420f94	149	imul $0xffffff57, %ebx, %edi		[Unknown]
0x420f9a	149	imul $0xffffff4e, %edx, %r11d		[Unknown]
0x420fa1	149	movq  %r12, 0x1a0(%rsp)		[Unknown]
0x420fa9	149	add %edi, %ebp		[Unknown]
0x420fab	149	movl  0x19c(%rcx), %edi		[Unknown]
0x420fb1	149	lea 0x1(%rbx,%r11,1), %ebx		[Unknown]
0x420fb6	149	movq  %r13, 0x1a8(%rsp)		[Unknown]
0x420fbe	149	mov %r9d, %r13d		[Unknown]
0x420fc1	149	movq  %r14, 0x1b0(%rsp)		[Unknown]
0x420fc9	149	mov %r10d, %r12d		[Unknown]
0x420fcc	149	movq  %r15, 0x1b8(%rsp)		[Unknown]

[for a couple of hundred lines ... ]

If I simply remove the calling to ranmar_get and reset the rng_seed counter in order to re-use the array of random numbers (what of course is an aberration in a mathematical/physical point of view) the overhead disappears. I have attached the entire assembly of the spin_rejection_test subroutine if it could be helpful..., thanks for your help.

jimdempseyatthecove · ‎12-28-2015

In spin_ranmar_rejection_test, remove i,j from being declared as threadprivate and save, and place into the tread private RNG_P, and initialize at initialization of other data members. The Dummy argument RNG will then be used to reference its enclosed i an j. The de-reference of the Dummy RNG should be faster than the locating of the tls data then dereferencing that. To avoid confusion, I suggest you make the thread private COMMON name (holding RNG_P) as tlsRNG.

      MODULE RANMAR_MOD
            
        TYPE RANMAR_T
          integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx,  
     *fool_optimizer,rng_seed,rng_array(128)
          integer*4 :: i,j
          real*4 twom24
        END TYPE RANMAR_T

        TYPE(RANMAR_T), POINTER :: RNG_P 
          COMMON/tlsRNG/ RNG_P 
C$OMP0THREADPRIVATE(/tlsRNG/)

      END MODULE RANMAR_MOD

and

      REAL*8 FUNCTION spin_rejection_test(qel,medium,elke,beta2,q1,cost,
     *spin_index,is_single,RNG)
      USE RANMAR_MOD
      implicit none
      TYPE(RANMAR_T) :: RNG
      ...
C      integer*4 i,j,k
C      save i,j
C C$OMP0THREADPRIVATE(i,j)
        integer*4 k
      IF (( spin_index )) THEN
        spin_index = .false.
        IF (( beta2 .GE. b2spin_min )) THEN
          ai = (beta2 - b2spin_min)*dbeta2i
          RNG%i = ai
          ai = ai - RNG%i
          RNG%i =  RNG%i + 15 + 1
        ELSE IF(( elke .GT. espml )) THEN
          ai = (elke - espml)*dleneri
          RNG%i = ai
          ai = ai - RNG%i
        ELSE
          RNG%i = 0
          ai = -1
        END IF
... continue changing references to i and j

Note, you can use ASSOCIATE i => RNG%I And same with j (remember to END ASSOCIATE).

Jim Dempsey

Edgardo_Doerner · ‎12-30-2015

Jim,

I introduced your last suggestion and it finally worked!, now spin_rejection is not listed under the __kmp_get_global_thread_id_reg function overhead in Vtune Amplifier XE. I introduced similar changes into mscat and sscat subroutines and I also was able to eliminate their contribution to the overhead due to the __kmp_get_global_thread_id_reg function.

So now the only subroutine with a significant overhead is ausgab. The good thing is that the changes made to the code until now are minimal, so it absolutely worths the effort. The overhead when introducing OpenMP is now around 2.5% of the serial code time, far more decent than the 6.5% of the original implementation. I will analyse which modifications I have to introduce to the ausgab subroutine, but with your suggestions I have a good idea of what I have to do.

I really really appreciate your help, patience and guidance with this problem. Have a nice new year celebration!

jimdempseyatthecove · ‎12-31-2015

Glad to be of assistance.

Happy New Year to you too.

Jim Dempsey