- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
maybe some of you remember my first topic with this issue https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/532179. I am working with a code written in Fortran 77 and I was able to introduce parallelism using OpenMP. However, the
overhead effects are very high, especially when the Intel Fortran Compiler is used. For example, for a Intel Core i7 with 8 Threads I have the following execution times (in seconds):
Serial code: 811.3
ifort OpenMP, 1 thread: 864.4
ifort OpenMP, 8 threads: 169.1
So, you can see that the overhead introduced to the program when OpenMP is enabled is around 6.5% of the serial time, that seems excessive.
After some time and following the advices given in the first post I decided to give a second try to this issue. I profiled my program using Amplifier XE and the report shows a couple of subroutines that share most of the overhead time. For example, consider the following subroutine:
REAL*8 FUNCTION spin_rejection(qel,medium,elke,beta2,q1,cost, spin *_index,is_single) IMPLICIT NONE REAL*8 elke,beta2,q1,cost INTEGER*4 qel,medium LOGICAL spin_index,is_single COMMON/SPIN_DATA/ spin_rej(1,0:1,0: 31,0:15,0:31), espin_min,espin *_max,espml,b2spin_min,b2spin_max, dbeta2,dbeta2i,dlener,dleneri,dq *q1,dqq1i, fool_intel_optimizer REAL*4 spin_rej,espin_min,espin_max,espml,b2spin_min,b2spin_max, d *beta2,dbeta2i,dlener,dleneri,dqq1,dqq1i LOGICAL fool_intel_optimizer COMMON/RANDOMM/ rng_array(128), urndm(97), crndm, cdrndm, cmrndm, *i4opt, ixx, jxx, fool_optimizer, twom24, rng_seed C$OMP0THREADPRIVATE(/RANDOMM/) INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti *mizer,rng_seed,rng_array REAL*4 twom24 REAL*8 rnno,ai,qq1,aj,xi,ak INTEGER*4 i,j,k SAVE i,j C$OMP0THREADPRIVATE(i,j) IF (( spin_index )) THEN spin_index = .false. IF (( beta2 .GE. b2spin_min )) THEN ai = (beta2 - b2spin_min)*dbeta2i i = ai ai = ai - i i = i + 15 + 1 ELSE IF(( elke .GT. espml )) THEN ai = (elke - espml)*dleneri i = ai ai = ai - i ELSE i = 0 ai = -1 END IF IF((rng_seed .GT. 128))CALL ranmar_get rnno = rng_array(rng_seed)*twom24 rng_seed = rng_seed + 1 IF((rnno .LT. ai))i = i + 1 IF (( is_single )) THEN j = 0 ELSE qq1 = 2*q1 qq1 = qq1/(1 + qq1) aj = qq1*dqq1i j = aj IF (( j .GE. 15 )) THEN j = 15 ELSE aj = aj - j IF((rng_seed .GT. 128))CALL ranmar_get rnno = rng_array(rng_seed)*twom24 rng_seed = rng_seed + 1 IF((rnno .LT. aj))j = j + 1 END IF END IF END IF xi = SQRT(0.5*(1-cost)) ak = xi*31 k = ak ak = ak - k spin_rejection = (1-ak)*spin_rej(medium,qel,i,j,k) + ak*spin_rej(m *edium,qel,i,j,k+1) RETURN END
The analysis shows that a lot of overhead occurs when the ranmar_get subroutine is called. It corresponds to:
SUBROUTINE ranmar_get IMPLICIT NONE COMMON/RANDOMM/ rng_array(128), urndm(97), crndm, cdrndm, cmrndm, *i4opt, ixx, jxx, fool_optimizer, twom24, rng_seed C$OMP0THREADPRIVATE(/RANDOMM/) INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti *mizer,rng_seed,rng_array REAL*4 twom24 INTEGER*4 i,iopt IF((rng_seed .EQ. 999999))CALL init_ranmar DO 2511 i=1,128 iopt = urndm(ixx) - urndm(jxx) IF((iopt .LT. 0))iopt = iopt + 16777216 urndm(ixx) = iopt ixx = ixx - 1 jxx = jxx - 1 IF ((ixx .EQ. 0)) THEN ixx = 97 ELSE IF(( jxx .EQ. 0 )) THEN jxx = 97 END IF crndm = crndm - cdrndm IF((crndm .LT. 0))crndm = crndm + cmrndm iopt = iopt - crndm IF((iopt .LT. 0))iopt = iopt + 16777216 rng_array(i) = iopt 2511 CONTINUE 2512 CONTINUE rng_seed = 1 RETURN END
Well, I decided to give a look into the assembly of the spin_rejection function and, for example, at the line when ranmar_get is first called I obtained the following:
0x4202fd Block 196:
0x4202f8 9,019 callq 0x401bc0 <for_write_seq_lis_xmit>
0x4202f0 9,019 lea 0x340(%rsp), %rdx
0x4202e9 9,019 movl %r13d, 0x2d0(%rdi)
0x4202e4 9,019 lea 0x70(%rsp), %rdi
0x4202df 9,019 mov $0x480b14, %esi
0x4202df Block 195:
0x4202da 9,019 callq 0x401bc0 <for_write_seq_lis_xmit>
0x4202d2 9,019 lea 0x320(%rsp), %rdx
0x4202cb 9,019 movl %r15d, 0x2b0(%rdi)
0x4202c6 9,019 lea 0x70(%rsp), %rdi
0x4202c1 9,019 mov $0x480b0c, %esi
[...]
that pattern repeats through 500 lines of the assembly code... At the exit of the spin_rejection function I have several calls to the functions __kmpc_threadprivate_cached and __kmpc_global_thread_num, with a huge overhead time for the latter.
For some reason and remembering that someone points out that maybe is something related to bounds checking of arrays so I decided to reformat the RNG to avoid any use of arrays. Essentially now the generator is called each time a random number is needed and not each a certain number of expended random numbers. So now I have the following version of the RNG:
SUBROUTINE ranmar_get IMPLICIT NONE COMMON/RANDOMM/ rng_array, urndm(97), crndm, cdrndm, cmrndm, i4opt *, ixx, jxx, fool_optimizer, twom24, rng_seed C$OMP0THREADPRIVATE(/RANDOMM/) INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti *mizer,rng_seed,rng_array REAL*4 twom24 INTEGER*4 iopt iopt = urndm(ixx) - urndm(jxx) IF((iopt .LT. 0))iopt = iopt + 16777216 urndm(ixx) = iopt ixx = ixx - 1 jxx = jxx - 1 IF ((ixx .EQ. 0)) THEN ixx = 97 ELSE IF(( jxx .EQ. 0 )) THEN jxx = 97 END IF crndm = crndm - cdrndm IF((crndm .LT. 0))crndm = crndm + cmrndm iopt = iopt - crndm IF((iopt .LT. 0))iopt = iopt + 16777216 rng_array = iopt RETURN END
the main change is that now rng_array is a single integer variable and not an array of a certain number of elements. So now the calling sequence in spin_rejection is simple CALL ranmar_get and I obtain the following assembly code for that section:
0x417ceb 8,959 sub %eax, %r8d 0x417ceb Block 25: 0x417ce4 8,959 addl 0x190(%r15), %eax 0x417ce4 Block 24: 0x417ce2 8,959 jns 0x417ceb <Block 25> 0x417ce0 8,959 sub %ebp, %eax 0x417cd9 8,959 movl 0x188(%r15), %eax 0x417cd2 8,959 movl 0x18c(%r15), %ebp 0x417cd2 Block 23: 0x417ccf 8,959 cmovz %eax, %ecx 0x417ccd 8,959 test %ecx, %ecx 0x417cc8 8,959 mov $0x61, %eax 0x417cc8 Block 22: 0x417cc6 8,959 jmp 0x417cd2 <Block 23> 0x417cc1 8,959 mov $0x61, %edx 0x417cc1 Block 21: 0x417cbf 8,959 jnz 0x417cc8 <Block 22> 0x417cbd 8,959 dec %edx 0x417cb9 8,959 movl %r8d, (%r15,%rdx,4) 0x417cb5 8,959 cmovs %eax, %r8d 0x417cae 8,959 lea 0x1000000(%r8), %eax 0x417cab 8,959 test %r8d, %r8d 0x417ca9 8,959 dec %ecx 0x417ca5 8,959 subl (%r15,%rcx,4), %r8d 0x417ca1 8,959 movl (%r15,%rdx,4), %r8d 0x417c9e 8,959 movsxd %ecx, %rcx 0x417c97 8,959 movl 0x19c(%r15), %ecx 0x417c94 8,959 movsxd %edx, %rdx 0x417c8d 8,959 movl 0x198(%r15), %edx
that is the entire assembly code for the first call of ranmar_get. Well, it is much simpler that the original version. The overhead cost of the call to ranmar_get inside spin_rejection also decreases heavily. __kmpc_threadprivate_cached and __kmpc_global_thread_num are also called at the end of the spin_rejection subroutine, but the overhead cost is much lower. I have starting believing that the use of arrays in the RNG is the source, in some way, of the overhead when OpenMP is enabled within the ifort compiler. Well, I have even tested other RNG with much simpler structures and I do not have any overhead cost pointed out by Amplifier XE. Even __kmpc_threadprivate_cached and __kmpc_global_thread_num are not called at all at the end of the spin_rejection subroutine.
I would like to know if something related with arrays could be the source of the overhead costs in my program, and if there is a way to alleviate this problem. I am not a programming expert so it is quite difficult to diagnose this problem and it limits the scalability of my program.
Thanks for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In spin_ranmar_rejection_test, remove i,j from being declared as threadprivate and save, and place into the tread private RNG_P, and initialize at initialization of other data members. The Dummy argument RNG will then be used to reference its enclosed i an j. The de-reference of the Dummy RNG should be faster than the locating of the tls data then dereferencing that. To avoid confusion, I suggest you make the thread private COMMON name (holding RNG_P) as tlsRNG.
MODULE RANMAR_MOD TYPE RANMAR_T integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx, *fool_optimizer,rng_seed,rng_array(128) integer*4 :: i,j real*4 twom24 END TYPE RANMAR_T TYPE(RANMAR_T), POINTER :: RNG_P COMMON/tlsRNG/ RNG_P C$OMP0THREADPRIVATE(/tlsRNG/) END MODULE RANMAR_MOD
and
REAL*8 FUNCTION spin_rejection_test(qel,medium,elke,beta2,q1,cost, *spin_index,is_single,RNG) USE RANMAR_MOD implicit none TYPE(RANMAR_T) :: RNG ... C integer*4 i,j,k C save i,j C C$OMP0THREADPRIVATE(i,j) integer*4 k IF (( spin_index )) THEN spin_index = .false. IF (( beta2 .GE. b2spin_min )) THEN ai = (beta2 - b2spin_min)*dbeta2i RNG%i = ai ai = ai - RNG%i RNG%i = RNG%i + 15 + 1 ELSE IF(( elke .GT. espml )) THEN ai = (elke - espml)*dleneri RNG%i = ai ai = ai - RNG%i ELSE RNG%i = 0 ai = -1 END IF ... continue changing references to i and j
Note, you can use ASSOCIATE i => RNG%I And same with j (remember to END ASSOCIATE).
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do not see anything in your provided code that is calling for_write_seq_lis_xmit.
It would be helpful if you can show your timed section (you can excise code unrelated to OpenMP region entry/exit/control)
If you have issues with overhead of functions __kmpc_threadprivate_cached and __kmpc_global_thread_num then I suggest you change your strategy from using Thread Local Storage for your random number generator to using an array of user defined types (containing what was in your COMMON TLS block), and then presumably you have a loop in your parallel region, separate the start of the parallel region from the parallel loop and obtain the OpenMP global thread number once outside the parallel loop. This will reduce the overhead to once per loop (per thread) from once per iteration of loop.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
This is the (only) parallel region of my code:
C$OMP0PARALLEL PRIVATE(omp_iam) COPYIN(/EGS_VR/,/EPCONT/) C$OMP*COPYIN(/ET_CONTROL/,/PHOTIN/,/RANDOMM/,/UPHIOT/,/USEFUL/,/SCORE/) C$OMP*NUM_THREADS(omp_tot) #ifdef _OPENMP omp_iam = OMP_GET_THREAD_NUM() #endif ixx = 1802 + omp_iam jxx = 9373 CALL init_ranmar C$OMP0DO SCHEDULE(static) DO 1051 omp_icase=1,ncase icase = omp_icase CALL shower(iqin,ein,xin,yin,zin,uin,vin,win,irin,wtin) C$OMP0CRITICAL (PROGRESS) ibatch = ibatch + 1 IF (( MOD(ibatch,nperbatch) .EQ. 0)) THEN WRITE(6,'(a,i2,a,i2)') '+ Finished batch ',ibatch/nperbatch, ' out * of ',nbatch END IF C$OMP0END CRITICAL (PROGRESS) 1051 CONTINUE 1052 CONTINUE C$OMP0END PARALLEL
So I have already separated the start of the parallel region from the parallel loop. The shower function then calls all the needed functions and subroutines, among them spin_rejection.
About the use of the TLS in the RNG, do you mean to create a structure like the following:
TYPE RANDOMM INTEGER urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_optimizer, rng_seed, rng_array(128) REAL twom24 END TYPE RANDOMM
and then declare an array, where each element is used by one of the executing threads?.
About the call to for_write_seq_lis_xmit, maybe I can put the code of the subroutine from where spin_rejection is called:
SUBROUTINE sscat(chia2,elke,beta2,qel,medium,spin_effects,cost,sin *t) IMPLICIT NONE REAL*8 chia2,elke,beta2,cost,sint INTEGER*4 qel,medium LOGICAL spin_effects COMMON/RANDOMM/ rng_array(128), urndm(97), crndm, cdrndm, cmrndm, *i4opt, ixx, jxx, fool_optimizer, twom24, rng_seed C$OMP0THREADPRIVATE(/RANDOMM/) INTEGER*4 urndm, crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_opti *mizer,rng_seed,rng_array REAL*4 twom24 REAL*8 xi,rnno,rejf,spin_rejection,qzero LOGICAL spin_index spin_index = .true. 5080 CONTINUE IF((rng_seed .GT. 128))CALL ranmar_get xi = rng_array(rng_seed)*twom24 rng_seed = rng_seed + 1 xi = 2*chia2*xi/(1 - xi + chia2) cost = 1 - xi IF (( spin_effects )) THEN qzero=0 rejf = spin_rejection(qel,medium,elke,beta2,qzero,cost,spin_index, *.true.) IF((rng_seed .GT. 128))CALL ranmar_get rnno = rng_array(rng_seed)*twom24 rng_seed = rng_seed + 1 IF((rnno .GT. rejf))GOTO 5080 END IF sint = SQRT(xi*(2 - xi)) RETURN END
I have also attached the entire code, if that could be of help. Thanks for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What does Amplifier(VTune) now show as the leading cause for overhead?
If the OpenMP related overhead is insignificant then there is nothing more to do.
If significant overhead continues to be functions __kmpc_threadprivate_cached and __kmpc_global_thread_num then I suggest you change your strategy from using Thread Local Storage to using a user defined types (containing what was in your COMMON TLS blocks).
I suggest adding a module that defines a user defined type containing what was in your commons that were thread private. This will require making them not COMMON. You might also want to place into the module the shared variables (not as COMMON). As an example you could call the user defined type as Context_t Then adding to your main program a USE of that module and pointer (or allocatable) of that type.
type(Context_t), pointer :: Context => NULL() ! try pointer first, then switch over to ALLOCATABLE later
and, add that to the PRIVATE clause and COPYIN of the (only) parallel region.
and, inside the parallel region but before the parallel DO, add an ALLOCATE(Context) so each thread has its own Context.
Then add Context as the first argument to the subroutine call lists, and all routines beneath those that require the context.
You will have to modify the called routines that require the context to have the Context listed, and a USE for that context.
The member variables will now be accessible via the Context%YourMemberVariableHere.
Note, there is a fair amount of work to do this, so weigh the work against the potential performance gain.
For some operating systems on IA32 and Intel64, they tie the Thread Local Storage to the descriptor of the FS or GS Selector. On these systems accessing TLS has almost no overhead (requires a segment override prefix to the instruction op code referencing the memory in TLS). For other systems, they require a round about way of locating the TLS, and this is performed by a function call (possibly __kmpc_threadprivate_cached and __kmpc_global_thread_num ) .
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If/when you have multiple parallel regions, and you wish to pass the context from region to region you can place the pointer in TLS
*** but then use a local copy of the pointer in the parallel region ***
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
first sorry for the delay in the answer, but the last week was a "hell of a week" at my university (a.k.a. end of semester)
Well, first the original version of the code has the following Concurrency Analysis results in Amplifier XE:
So in that image it is clear that a huge portion of the overhead time is due to spin_rejection and ausgab subroutines. When I modified the RNG I obtained the following analysis at Amplifier XE:
I will try your approach of avoiding the TLS. By the way, this overhead does not happen with gfortran... for example, the overhead of adding OpenMP support for 1 thread is just about 1.2 %, that I suppose is quite acceptable. Gfortran handles TLS in a different way than IFORT?. Maybe I could just switch to gfortran, but I also want to use this code with Intel Xeon Phi accelerators, so I would like to be able to solve this issue... thanks for your help!.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Gfortran handles TLS in a different way than IFORT?.
I suspect so.
In both screenshots, the call to __kmp_get_global_thread_id_reg is the predominant overhead.
Place what is currently in your thread local storage into a user defined type. Then create a thread local instance of this type.
type TypeThreadContext ... ! your declarations here end type TypeThreadContext type(TypeThreadContext) :: ThreadContext COMMON /tlsCONTEXT/ ThreadContext !$OMP THREADPRIVATE(/tlsCONTEXT/)
Then in the routines that use the context multiple times, declare a pointer to the thread context
type(TypeThreadContext), pointer :: Context ... Context => ThreadContext
Then use Context to reference the member variables.
If you still have significant number of calls to __kmp_get_global_thread_id_reg, then consider obtaining the pointer to the thread context higher up in the call level and pass the pointer (reference) to the context object as part of the argument list on the calls.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your answer... I am trying to implement your suggestion but I have problems with the use of pointers. I have really tried to find information about this without success... so the question is how to pass a pointer as an argument to a subroutine without loosing its association. For example, as you say I have created an user defined type containing the state of the RNG:
TYPE RANMAR_T integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_ *optimizer,rng_seed,rng_array(128) real*4 twom24 END TYPE RANMAR_T COMMON/RNG/ RNG_STATE C$OMP0THREADPRIVATE(/RNG/) TYPE(RANMAR_T), TARGET :: RNG_STATE [...] TYPE(RANMAR_T), POINTER :: RNG_P
Then inside de parallel region I associate a private copy for each thread of the pointer RNG_P with RNG_STATE. I am able to define the seeds of the RNG through the pointer
C$OMP0PARALLEL PRIVATE(omp_iam,RNG_P) COPYIN(/EGS_VR/,/EPCONT/) C$OMP*COPYIN(/ET_CONTROL/,/PHOTIN/,/RANDOMM/,/UPHIOT/,/USEFUL/,/SCORE/) [...] RNG_P => RNG_STATE RNG_P%ixx = 1802 RNG_P%jxx = 9373 + omp_iam call init_ranmar_test(RNG_P) [...]
when entering init_ranmar_test subroutine the pointer lose its association with RNG_STATE, triggering an error during execution. This subroutine looks like this:
subroutine init_ranmar_test(RNG_P) implicit none TYPE RANMAR_T integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_ *optimizer,rng_seed,rng_array(128) real*4 twom24 END TYPE RANMAR_T TYPE(RANMAR_T), POINTER :: RNG_P integer*4 s,t integer*4 i,j,k,l,m,ii,jj WRITE(*,*) ASSOCIATED(RNG_P) WRITE(*,*) RNG_P%ixx, RNG_P%jxx IF((RNG_P%ixx .LE. 0 .OR. RNG_P%ixx .GT. 31328))RNG_P%ixx = 1802 IF((RNG_P%jxx .LE. 0 .OR. RNG_P%jxx .GT. 30081))RNG_P%jxx = 9373 i = mod(RNG_P%ixx/177,177) + 2 j = mod(RNG_P%ixx, 177) + 2 k = mod(RNG_P%jxx/169,178) + 1 l = mod(RNG_P%jxx, 169) [...]
So the use with ASSOCIATED returns false and finally the program crashes if it is tried to use the pointer. I have tried to find an answer to how to properly give a pointer as an argument to a subroutine in Fortran without success, it seems a quite obscure topic and from my point of view (or lack of expertise) it is quite difficult to understand... thanks for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
module RANMAR_MOD TYPE RANMAR_T integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_ *optimizer,rng_seed,rng_array(128) real*4 twom24 END TYPE RANMAR_T COMMON/RNG/ RNG_STATE C$OMP0THREADPRIVATE(/RNG/) TYPE(RANMAR_T), POINTER :: RNG_P end module RANMAR_MOD [...] ! once_only.f90 startup once only code (place call near startup) subroutine once_only use RANMAR_MOD !$OMP PARALLEL ALLOCATE(RNG_P) !$OMP END PARALLEL end subroutine once_only
Then use:
C$OMP0PARALLEL PRIVATE(omp_iam) COPYIN(/EGS_VR/,/EPCONT/) C$OMP*COPYIN(/ET_CONTROL/,/PHOTIN/,/RANDOMM/,/UPHIOT/,/USEFUL/,/SCORE/) [...] RNG_P%ixx = 1802 ! you can access via TLS pointer RNG_P%jxx = 9373 + omp_iam ! you can access via TLS pointer ! and/or call init_ranmar_test(RNG_P) ! produce reference to the object of TLS pointer ! note, above call calls __kmp_get_global_thread_id_reg once to produce reference [...]
Then elsewhere
subroutine init_ranmar_test(RNG) ! note, no "_P" implicit none use RANMAR_MOD ! expose RANMAR_T (and unused RNG_P in TLS) TYPE(RANMAR_T) :: RNG ! reference not pointer integer*4 s,t integer*4 i,j,k,l,m,ii,jj WRITE(*,*) RNG%ixx, RNG%jxx ! note, no "_P" IF((RNG%ixx .LE. 0 .OR. RNG%ixx .GT. 31328))RNG%ixx = 1802 IF((RNG%jxx .LE. 0 .OR. RNG%jxx .GT. 30081))RNG%jxx = 9373 i = mod(RNG%ixx/177,177) + 2 j = mod(RNG%ixx, 177) + 2 k = mod(RNG%jxx/169,178) + 1 l = mod(RNG%jxx, 169) [...]
The above subroutine has does not call __kmp_get_global_thread_id_reg.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
I was looking at your code and shouldn't be RNG_P inside the module also threadprivate?
I will introduce the changes, thanks for the help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are correct, my mistake
module RANMAR_MOD TYPE RANMAR_T integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx, fool_ *optimizer,rng_seed,rng_array(128) real*4 twom24 END TYPE RANMAR_T TYPE(RANMAR_T), POINTER :: RNG_P COMMON /RNG/ RNG_P C$OMP0THREADPRIVATE(/RNG/) end module RANMAR_MOD
Too quick at copy and paste
Thanks for pointing out my error
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Jim, I have introduced your suggested changes without success... I defined the following type:
MODULE RANMAR_MOD TYPE RANMAR_T integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx, *fool_optimizer,rng_seed,rng_array(128) real*4 twom24 END TYPE RANMAR_T TYPE(RANMAR_T), POINTER :: RNG_P COMMON/RNG/ RNG_P C$OMP0THREADPRIVATE(/RNG/) END MODULE RANMAR_MOD
and allocated the pointer at the beginning of the program using the following subroutine:
SUBROUTINE ONCE_ONLY USE RANMAR_MOD C$OMP0PARALLEL ALLOCATE(RNG_P) C$OMP0END PARALLEL END SUBROUTINE ONCE_ONLY
then initialise it with the following:
SUBROUTINE init_ranmar_test(RNG) USE RANMAR_MOD implicit none TYPE(RANMAR_T) :: RNG integer*4 s,t integer*4 i,j,k,l,m,ii,jj IF((RNG%ixx .LE. 0 .OR. RNG%ixx .GT. 31328))RNG%ixx = 1802 IF((RNG%jxx .LE. 0 .OR. RNG%jxx .GT. 30081))RNG%jxx = 9373 i = mod(RNG%ixx/177,177) + 2 j = mod(RNG%ixx, 177) + 2 k = mod(RNG%jxx/169,178) + 1 l = mod(RNG%jxx, 169) DO 2631 ii=1,97 s = 0 t = 8388608 DO 2641 jj=1,24 m = mod(mod(i*j,179)*k,179) IF (( RNG%fool_optimizer .EQ. 999 )) THEN write(6,*) i,j,k,m,s,t END IF i = j IF (( RNG%fool_optimizer .EQ. 999 )) THEN write(6,*) i,j,k,m,s,t END IF j = k IF (( RNG%fool_optimizer .EQ. 999 )) THEN write(6,*) i,j,k,m,s,t END IF k = m IF (( RNG%fool_optimizer .EQ. 999 )) THEN write(6,*) i,j,k,m,s,t END IF l = mod(53*l+1,169) IF (( RNG%fool_optimizer .EQ. 999 )) THEN write(6,*) i,j,k,m,s,t END IF IF((mod(l*m,64) .GE. 32))s = s + t IF (( RNG%fool_optimizer .EQ. 999 )) THEN write(6,*) i,j,k,m,s,t END IF t = t/2 IF (( RNG%fool_optimizer .EQ. 999 )) THEN write(6,*) i,j,k,m,s,t END IF 2641 CONTINUE 2642 CONTINUE RNG%urndm(ii) = s 2631 CONTINUE 2632 CONTINUE
and finally, in order to generate the random numbers I defined this:
SUBROUTINE ranmar_get_test(RNG) USE RANMAR_MOD implicit none TYPE(RANMAR_T) :: RNG integer*4 i,iopt IF((RNG%rng_seed .EQ. 999999))call init_ranmar_test(RNG) DO 2651 i=1,128 iopt = RNG%urndm(RNG%ixx) - RNG%urndm(RNG%jxx) IF((iopt .LT. 0))iopt = iopt + 16777216 RNG%urndm(RNG%ixx) = iopt RNG%ixx = RNG%ixx - 1 RNG%jxx = RNG%jxx - 1 IF ((RNG%ixx .EQ. 0)) THEN RNG%ixx = 97 ELSE IF(( RNG%jxx .EQ. 0 )) THEN RNG%jxx = 97 END IF RNG%crndm = RNG%crndm - RNG%cdrndm IF((RNG%crndm .LT. 0))RNG%crndm = RNG%crndm + RNG%cmrndm iopt = iopt - RNG%crndm IF((iopt .LT. 0))iopt = iopt + 16777216 RNG%rng_array(i) = iopt 2651 CONTINUE 2652 CONTINUE RNG%rng_seed = 1 RETURN END SUBROUTINE ranmar_get_test
For example, the subroutine spin_rejection which uses this RNG looks like:
REAL*8 FUNCTION spin_rejection_test(qel,medium,elke,beta2,q1,cost, *spin_index,is_single,RNG) USE RANMAR_MOD implicit none TYPE(RANMAR_T) :: RNG real*8 elke,beta2,q1,cost integer*4 qel,medium logical spin_index,is_single common/spin_data/ spin_rej(5,0:1,0: 31,0:15,0:31), espin_min,espin *_max,espml,b2spin_min,b2spin_max, dbeta2,dbeta2i,dlener,dleneri,dq *q1,dqq1i, fool_intel_optimizer real*4 spin_rej,espin_min,espin_max,espml,b2spin_min,b2spin_max, d *beta2,dbeta2i,dlener,dleneri,dqq1,dqq1i logical fool_intel_optimizer real*8 rnno,ai,qq1,aj,xi,ak integer*4 i,j,k save i,j C$OMP0THREADPRIVATE(i,j) IF (( spin_index )) THEN spin_index = .false. IF (( beta2 .GE. b2spin_min )) THEN ai = (beta2 - b2spin_min)*dbeta2i i = ai ai = ai - i i = i + 15 + 1 ELSE IF(( elke .GT. espml )) THEN ai = (elke - espml)*dleneri i = ai ai = ai - i ELSE i = 0 ai = -1 END IF IF((RNG%rng_seed .GT. 128)) THEN call ranmar_get_test(RNG) END IF rnno = RNG%rng_array(RNG%rng_seed)*RNG%twom24 RNG%rng_seed = RNG%rng_seed + 1 IF((rnno .LT. ai))i = i + 1 IF (( is_single )) THEN j = 0 ELSE qq1 = 2*q1 qq1 = qq1/(1 + qq1) aj = qq1*dqq1i j = aj IF (( j .GE. 15 )) THEN j = 15 ELSE aj = aj - j IF((RNG%rng_seed .GT. 128)) THEN call ranmar_get_test(RNG) END IF rnno = RNG%rng_array(RNG%rng_seed)*RNG%twom24 RNG%rng_seed = RNG%rng_seed + 1 IF((rnno .LT. aj))j = j + 1 END IF END IF END IF xi = Sqrt(0.5*(1-cost)) ak = xi*31 k = ak ak = ak - k spin_rejection_test = (1-ak)*spin_rej(medium,qel,i,j,k) + ak*spin_ *rej(medium,qel,i,j,k+1) RETURN END FUNCTION spin_rejection_test
and well, looking at the amplifier analysis and the assembly code the "call ranmar_get..." sentences continues to be the problematic line. As you can see in the code fragments above, I have tried to obtain the pointer at different places in the call sequence (even at the main program!) and the situation is the same. Here a extract of the assembly shown by Vtune Amplifier XE for the first call to ranmar_get_test:
Address Source Line Assembly Source Line Assembly 0x420ee4 149 test %ebx, %ebx [Unknown] 0x420ee6 149 jle 0x420ef0 <Block 126> [Unknown] 0x420ee8 Block 125: [Unknown] 0x420ee8 149 cmp $0x7a60, %ebx [Unknown] 0x420eee 149 jle 0x420eff <Block 127> [Unknown] 0x420ef0 Block 126: [Unknown] 0x420ef0 149 movl $0x70a, 0x194(%rcx) [Unknown] 0x420efa 149 mov $0x70a, %ebx [Unknown] 0x420eff Block 127: [Unknown] 0x420eff 149 movl 0x198(%rcx), %ebp [Unknown] 0x420f05 149 test %ebp, %ebp [Unknown] 0x420f07 149 jle 0x420f11 <Block 129> [Unknown] 0x420f09 Block 128: [Unknown] 0x420f09 149 cmp $0x7581, %ebp [Unknown] 0x420f0f 149 jle 0x420f20 <Block 130> [Unknown] 0x420f11 Block 129: [Unknown] 0x420f11 149 movl $0x249d, 0x198(%rcx) [Unknown] 0x420f1b 149 mov $0x249d, %ebp [Unknown] 0x420f20 Block 130: [Unknown] 0x420f20 149 mov $0xb92143fb, %eax [Unknown] 0x420f25 149 mov %ebx, %edi [Unknown] 0x420f27 149 imul %ebx [Unknown] 0x420f29 149 mov %edx, %esi [Unknown] 0x420f2b 149 mov $0xb92143fb, %eax [Unknown] 0x420f30 149 add %ebx, %esi [Unknown] 0x420f32 149 sar $0x7, %esi [Unknown] 0x420f35 149 sar $0x1f, %edi [Unknown] 0x420f38 149 sub %edi, %esi [Unknown] 0x420f3a 149 imul %esi [Unknown] 0x420f3c 149 mov %esi, %r11d [Unknown] 0x420f3f 149 add %esi, %edx [Unknown] 0x420f41 149 sar $0x7, %edx [Unknown] 0x420f44 149 sar $0x1f, %r11d [Unknown] 0x420f48 149 sub %r11d, %edx [Unknown] 0x420f4b 149 imul $0xffffff4f, %edx, %eax [Unknown] 0x420f51 149 imul $0xffffff4f, %esi, %r10d [Unknown] 0x420f58 149 movsdq %xmm0, (%rsp) [Unknown] 0x420f5d 149 movq %r8, 0x18(%rsp) [Unknown] 0x420f62 149 lea 0x2(%rsi,%rax,1), %r9d [Unknown] 0x420f67 149 mov $0x60f25deb, %eax [Unknown] 0x420f6c 149 lea 0x2(%rbx,%r10,1), %r10d [Unknown] 0x420f71 149 imul %ebp [Unknown] 0x420f73 149 mov %edx, %ebx [Unknown] 0x420f75 149 mov %ebp, %esi [Unknown] 0x420f77 149 sar $0x6, %ebx [Unknown] 0x420f7a 149 mov $0xb81702e1, %eax [Unknown] 0x420f7f 149 sar $0x1f, %esi [Unknown] 0x420f82 149 sub %esi, %ebx [Unknown] 0x420f84 149 xor %esi, %esi [Unknown] 0x420f86 149 imul %ebx [Unknown] 0x420f88 149 mov %ebx, %eax [Unknown] 0x420f8a 149 add %ebx, %edx [Unknown] 0x420f8c 149 sar $0x7, %edx [Unknown] 0x420f8f 149 sar $0x1f, %eax [Unknown] 0x420f92 149 sub %eax, %edx [Unknown] 0x420f94 149 imul $0xffffff57, %ebx, %edi [Unknown] 0x420f9a 149 imul $0xffffff4e, %edx, %r11d [Unknown] 0x420fa1 149 movq %r12, 0x1a0(%rsp) [Unknown] 0x420fa9 149 add %edi, %ebp [Unknown] 0x420fab 149 movl 0x19c(%rcx), %edi [Unknown] 0x420fb1 149 lea 0x1(%rbx,%r11,1), %ebx [Unknown] 0x420fb6 149 movq %r13, 0x1a8(%rsp) [Unknown] 0x420fbe 149 mov %r9d, %r13d [Unknown] 0x420fc1 149 movq %r14, 0x1b0(%rsp) [Unknown] 0x420fc9 149 mov %r10d, %r12d [Unknown] 0x420fcc 149 movq %r15, 0x1b8(%rsp) [Unknown] [for a couple of hundred lines ... ]
If I simply remove the calling to ranmar_get and reset the rng_seed counter in order to re-use the array of random numbers (what of course is an aberration in a mathematical/physical point of view) the overhead disappears. I have attached the entire assembly of the spin_rejection_test subroutine if it could be helpful..., thanks for your help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In spin_ranmar_rejection_test, remove i,j from being declared as threadprivate and save, and place into the tread private RNG_P, and initialize at initialization of other data members. The Dummy argument RNG will then be used to reference its enclosed i an j. The de-reference of the Dummy RNG should be faster than the locating of the tls data then dereferencing that. To avoid confusion, I suggest you make the thread private COMMON name (holding RNG_P) as tlsRNG.
MODULE RANMAR_MOD TYPE RANMAR_T integer*4 urndm(97), crndm, cdrndm, cmrndm, i4opt, ixx, jxx, *fool_optimizer,rng_seed,rng_array(128) integer*4 :: i,j real*4 twom24 END TYPE RANMAR_T TYPE(RANMAR_T), POINTER :: RNG_P COMMON/tlsRNG/ RNG_P C$OMP0THREADPRIVATE(/tlsRNG/) END MODULE RANMAR_MOD
and
REAL*8 FUNCTION spin_rejection_test(qel,medium,elke,beta2,q1,cost, *spin_index,is_single,RNG) USE RANMAR_MOD implicit none TYPE(RANMAR_T) :: RNG ... C integer*4 i,j,k C save i,j C C$OMP0THREADPRIVATE(i,j) integer*4 k IF (( spin_index )) THEN spin_index = .false. IF (( beta2 .GE. b2spin_min )) THEN ai = (beta2 - b2spin_min)*dbeta2i RNG%i = ai ai = ai - RNG%i RNG%i = RNG%i + 15 + 1 ELSE IF(( elke .GT. espml )) THEN ai = (elke - espml)*dleneri RNG%i = ai ai = ai - RNG%i ELSE RNG%i = 0 ai = -1 END IF ... continue changing references to i and j
Note, you can use ASSOCIATE i => RNG%I And same with j (remember to END ASSOCIATE).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
I introduced your last suggestion and it finally worked!, now spin_rejection is not listed under the __kmp_get_global_thread_id_reg function overhead in Vtune Amplifier XE. I introduced similar changes into mscat and sscat subroutines and I also was able to eliminate their contribution to the overhead due to the __kmp_get_global_thread_id_reg function.
So now the only subroutine with a significant overhead is ausgab. The good thing is that the changes made to the code until now are minimal, so it absolutely worths the effort. The overhead when introducing OpenMP is now around 2.5% of the serial code time, far more decent than the 6.5% of the original implementation. I will analyse which modifications I have to introduce to the ausgab subroutine, but with your suggestions I have a good idea of what I have to do.
I really really appreciate your help, patience and guidance with this problem. Have a nice new year celebration!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Glad to be of assistance.
Happy New Year to you too.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page