Solved: [OMP] Threadprivate common dramatically decrease performance

François-Xavier · ‎07-06-2009

Dear Intel forumers,

I have recently introduced THREADPRIVATE statements in some of my Fortran commons in order to make the variables contained in the commonsprivatefor each thread. These commons contain quite large variables. Dynamic threading is set to OFF and i use same number of threads in all parallel regions.

I noticed that, when i set the number of thread to 1, the performances of the program with/without the threadprivate attribute are very different. (30-40 % slower with threadprivate attributes when running optimized (O2) version of the code).

I have carefully read the documentation andI don't really understand why. In these conditions, I thought the memory was allocated once when i first used the threadprivate common in the first parallel region and stay "alive" during all the program execution. So, the "cost" of the threadprivate should be only a one shot at the beginning. It seems not to be the case. Could you tell me more about it? Are there options to optimize the use of threadprivate statements?

I use Intel Fortran Windows Compiler 10.0.0.27.

Thank you for your Help,

F-Xavier Bouchez

jimdempseyatthecove · ‎07-07-2009

What you observe is correct. Access to thread private variable will induce some additional overhead. The overhead per access is small but noticable. In the event of a simple routine such as presented inyour example the overhead is significant. However, the overhead may be insignificant when performed in practice in your real application.

I suggest you place a break on the routines and opening the dissassembly window. Examine both the TLS loads and the non-TLS loads. You will notice a few extra assembly instructions to accomplish the stores. (some of these instructions can be optimized out in Release mode).

Example:

Assume you have a small vector containing the X,Y and Z components called POS(3)and you wish to rotate this vector. When this vector is in thread local storage the CALL ROTATE(POS, ROT) would incure the small overhead only once in constructing the address of POS (or array descriptor for POS), but the routine ROTATE would contain no such overhead.

Your alternative to using Thread Local Storage is to pass a thread context pointer in all subroutine and function calls that require thread contexed information (big programming effort). The overhead in performing that generally far exceeds the compiler performing this for you. Essentially the thread local storage becomes something similar to THREADCONTEXT%ASSEMBLYCONTEXT%yourTLSVariableHere

Where THREADCONTEXT is a vendor method for obtaining a thread by thread context area, ASSEMBLYCONTEXT is a vendor method of obtaining an assembly (compile time object) specific thread context variable and yourTLSVariableHere is your thread local storage variable name. But the compiler does this automagicly thus hiding the THREADCONTEXT%ASSEMBLYCONTEXT% and providing a portable programming means.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎07-06-2009

When you compile with OpenMP enabled .AND. specify one thread, the code generated will still use the threadprivate access methods for dereferencing threadprivate variables (a few extra instructions per dereference).

Therefore any slowdown due to threadprivate will be visible in the one thread OpenMP produced application using threadprivates. When you observe a slowdown when going from one thread (in mt generated program) to multiple threads then suspect adverse memory interactions (cache evictions). Or, perhaps, making a parallel version of a short loop that cannot ammortize the thread start/stop overhead in a cost effective manner.

Can you supply a small code example?

TimP · ‎07-06-2009

You would expect to pay the cost of copying data in and out of threadprivate at the beginning and end of each parallel region, which could be substantially more than the bare allocation cost. I don't know how the out-dated compiler would affect this, except that it doesn't include the current version of OpenMP library.

François-Xavier · ‎07-06-2009

Quoting - tim18

You would expect to pay the cost of copying data in and out of threadprivate at the beginning and end of each parallel region, which could be substantially more than the bare allocation cost. I don't know how the out-dated compiler would affect this, except that it doesn't include the current version of OpenMP library.

Dear tim, thank you for your response, I haven't set any "options" to activate the copy of threadprivate variables at the end of parallel region. In fact,the threadprivate variables are just a set of "temporary variables" shared across some subroutines which don't need to be used at the end of the parallel region. In this situation, are there default copies at the end of a parallel region ?

This way of coding could be discussed. In fact, i try to parrallelize some old big Fortran codes with a minimum of changes in the structure.

jimdempseyatthecove · ‎07-06-2009

Quoting - tim18

You would expect to pay the cost of copying data in and out of threadprivate at the beginning and end of each parallel region, which could be substantially more than the bare allocation cost. I don't know how the out-dated compiler would affect this, except that it doesn't include the current version of OpenMP library.

Tim,

The shared (local) data (scalars or descriptor) would be copied (as hidden dummy arguments) but the threadprivate data persists (no copying required).

Jim

François-Xavier · ‎07-06-2009

Quoting - jimdempseyatthecove

When you compile with OpenMP enabled .AND. specify one thread, the code generated will still use the threadprivate access methods for dereferencing threadprivate variables (a few extra instructions per dereference).

Therefore any slowdown due to threadprivate will be visible in the one thread OpenMP produced application using threadprivates. When you observe a slowdown when going from one thread (in mt generated program) to multiple threads then suspect adverse memory interactions (cache evictions). Or, perhaps, making a parallel version of a short loop that cannot ammortize the thread start/stop overhead in a cost effective manner.

Can you supply a small code example?

Thank you for your response.

This is the idea of the global structure:

toto.h:
COMMON /TEST/ TMPVAR(A_BIG_NUMBER)

!$OMP THREADPRIVATE (/TEST/)

first_routine.f

INCLUDE 'toto.h'

... work on TMPVAR in subroutines...

second_routine.f

!$OMP PARALLEL

CALL first_routine(...)

!$OMP END PARALLEL

third_routine.f

CALL first_routine(...)

In this structure, first_routine can be called inside or outside a parallel region. Since this first_routine use a common declared thread private, how will this common behave when used outside a parallel region? can i manage the code in an other way?

Results are good, but performance not. I noticed that (to be confirmed) when i suppress my parrallel regions but when i let the threadprivate commons, the code remains slow.

Since I use OPENMP2, maybe some improvements have been made in OPENMP3 (I will receive IFORT 11 soon)

Sincerely yours

jimdempseyatthecove · ‎07-06-2009

[cpp]toto.h:

  COMMON /TEST/ TLSVAR(A_BIG_NUMBER)
  !$OMP THREADPRIVATE (/TEST/)

  COMMON /TESTTWO/ STATICVAR(A_BIG_NUMBER)
	

first_routine.f

SUBROUTINE FIRST_ROUTINE(FOO)
  DIMENSION FOO(A_BIG_NUMBER)
  INCLUDE 'toto.h'
           
  FOO(INDEX) = expression        ! modifies cell in array passed as dummy FOO
  TLSVAR(INDEX) = expresssion    ! modifies current threads (may be main)
  STATICVAR(INDEX) = expresssion ! modifies cell in static array


second_routine.f

  DIMENSION LOCALVAR(A_BIG_NUMBER) ! assume it fits on stack

  !$OMP PARALLEL SHARED(LOCALVAR)

  CALL first_routine(TLSVAR)    ! FOO becomes current threads TLSVAR
  CALL first_routine(LOCALVAR)  ! FOO becomes current threads LOCALVAR
  CALL first_routine(STATICVAR) ! FOO becomes all threads STATICVAR
 
  !$OMP END PARALLEL

third_routine.f ! (not in parallel region)

  DIMENSION LOCALVAR(A_BIG_NUMBER) ! assume it fits on stack


  CALL first_routine(TLSVAR)    ! FOO becomes current threads TLSVAR (same as MAIN thread TLSVAR)
  CALL first_routine(LOCALVAR)  ! FOO becomes current threads LOCALVAR
  CALL first_routine(STATICVAR) ! FOO becomes all threads STATICVAR



[/cpp]

jimdempseyatthecove · ‎07-06-2009

If you can send a working (small) code section illustrating your problem we can better assist you in determining the problem (usualy a coding error on your part due to unfamiliarity of programming model).

Jim

François-Xavier · ‎07-07-2009

Quoting - jimdempseyatthecove

If you can send a working (small) code section illustrating your problem we can better assist you in determining the problem (usualy a coding error on your part due to unfamiliarity of programming model).

Jim

Dear, i have created a simple sample code which resume my performance problem. You'll see that there is no parallel section, only Threadprivate statements. (PS: I know that using local variables, this example will be far much faster, but it's not the aim of the topic)

When i execute this code (release), i obtain:

NON-THREADPRIVATE LOOP TIME = 17.9481196632551
THREADPRIVATE LOOP TIME = 19.7407659353339
Press any key to continue . . .

I would like to know more about this performance gap. I am sorry for the presentation of the code, i am quite new using this interface.

[cpp]Threadprivate.h


      integer*4 i1, i2, i3, i4, i5, i6
      
      COMMON /TEST / i1, i2, i3
      COMMON /TEST2/ i4, i5, i6
      
!$OMP THREADPRIVATE (/TEST/)

My_main.f


program My_threadprivate_example
      
      use ifport
      
      implicit none
      
      include 'omp_lib.h'
      
      integer*4 i
      double precision ytim1, ytim2

      ytim1 = OMP_GET_WTIME()      
      do i=1,10000000000
         call My_second_routine()
      enddo  
      ytim2 = OMP_GET_WTIME()  
      write(*,*) 'NON-THREADPRIVATE LOOP TIME = ', ytim2 - ytim1
      
      ytim1 = OMP_GET_WTIME()      
      do i=1,10000000000
         call My_first_routine()
      enddo  
      ytim2 = OMP_GET_WTIME()  
      write(*,*) 'THREADPRIVATE LOOP TIME = ', ytim2 - ytim1
        
      i = SYSTEM ("PAUSE")
     
      end program My_threadprivate_example


My_first_routine.f


       subroutine My_first_routine
      
       include 'Threadprivate.h'
      
       i1 = 1
       i2 = 2
       i3 = 3
      
       end subroutine


My_second_routine.f


       subroutine My_second_routine
      
       include 'Threadprivate.h'
      
       i4 = 1
       i5 = 2
       i6 = 3
      
       end subroutine


[/cpp]

jimdempseyatthecove · ‎07-07-2009

What you observe is correct. Access to thread private variable will induce some additional overhead. The overhead per access is small but noticable. In the event of a simple routine such as presented inyour example the overhead is significant. However, the overhead may be insignificant when performed in practice in your real application.

I suggest you place a break on the routines and opening the dissassembly window. Examine both the TLS loads and the non-TLS loads. You will notice a few extra assembly instructions to accomplish the stores. (some of these instructions can be optimized out in Release mode).

Example:

Assume you have a small vector containing the X,Y and Z components called POS(3)and you wish to rotate this vector. When this vector is in thread local storage the CALL ROTATE(POS, ROT) would incure the small overhead only once in constructing the address of POS (or array descriptor for POS), but the routine ROTATE would contain no such overhead.

Your alternative to using Thread Local Storage is to pass a thread context pointer in all subroutine and function calls that require thread contexed information (big programming effort). The overhead in performing that generally far exceeds the compiler performing this for you. Essentially the thread local storage becomes something similar to THREADCONTEXT%ASSEMBLYCONTEXT%yourTLSVariableHere

Where THREADCONTEXT is a vendor method for obtaining a thread by thread context area, ASSEMBLYCONTEXT is a vendor method of obtaining an assembly (compile time object) specific thread context variable and yourTLSVariableHere is your thread local storage variable name. But the compiler does this automagicly thus hiding the THREADCONTEXT%ASSEMBLYCONTEXT% and providing a portable programming means.

Jim Dempsey

François-Xavier · ‎07-07-2009

I totally agree with your explanation.

In my case, I notice that the overhead becomes very importantfor the commons which contains lots of very small variables and used in functions called millions of times.

F-Xavier Bouchez

jimdempseyatthecove · ‎07-08-2009

For the variables in thread private areas you can boost them to stack local at entry to subroutine.

subroutine foo()
use xxx ! or include 'yourthreadprivate.h'
real, pointer :: modifiableVar
integer :: copyConstantInt
real :: transitionalVar2
...
modifiableVar => tlsVar
copyConstantInt = tlsInt
transitionalVar2 = tlsVar2
... ! use modifiableVarcopyConstantInttransitionalVar2
tlsVar2 = transitionalVar2 ! restore modified local transitional copy
end subroutine foo

or

subroutine foo()
use xxx ! or include 'yourthreadprivate.h'
real, pointer :: modifiableVar
integer :: copyConstantInt
real :: transitionalVar2
...
modifiableVar => tlsVar
copyConstantInt = tlsInt
transitionalVar2 = tlsVar2
#define tlsVar modifiableVar
#define tlsInt copyConstantInt
... ! use tlsVar tlsInt transitionalVar2
tlsVar2 = transitionalVar2 ! restore modified local transitional copy
end subroutine foo

Note, #define is active from line it is located on through end of compilation unit (not end of subroutine)

Jim Dempsey

François-Xavier · ‎07-09-2009

Quoting - jimdempseyatthecove

For the variables in thread private areas you can boost them to stack local at entry to subroutine.

subroutine foo()
use xxx ! or include 'yourthreadprivate.h'
real, pointer :: modifiableVar
integer :: copyConstantInt
real :: transitionalVar2
...
modifiableVar => tlsVar
copyConstantInt = tlsInt
transitionalVar2 = tlsVar2
... ! use modifiableVarcopyConstantInttransitionalVar2
tlsVar2 = transitionalVar2 ! restore modified local transitional copy
end subroutine foo

or

subroutine foo()
use xxx ! or include 'yourthreadprivate.h'
real, pointer :: modifiableVar
integer :: copyConstantInt
real :: transitionalVar2
...
modifiableVar => tlsVar
copyConstantInt = tlsInt
transitionalVar2 = tlsVar2
#define tlsVar modifiableVar
#define tlsInt copyConstantInt
... ! use tlsVar tlsInt transitionalVar2
tlsVar2 = transitionalVar2 ! restore modified local transitional copy
end subroutine foo

Note, #define is active from line it is located on through end of compilation unit (not end of subroutine)

Jim Dempsey

I am awfully sorry but this time i don't really understand the piece of code and the aim of each variable.

How could it be applied in the sample code i provided? Will it really help?

Sincerely yours,

Thank you for your help,

F-Xavier Bouchez

jimdempseyatthecove · ‎07-09-2009

common variables
i
j
k

thread private variables
tp_i
tp_j
tp_k

when compiled for multi-threaded use creates multiple context areas

thread 0 context thread 1 context ... thread n context
tp_i tp_i tp_i
tp_j tp_j tp_j
tp_k tp_k tp_k

The compiler will be able to generate code
that knows where i,j,k will be located at runtime.

The compiler will NOT be able to generate code
that knows where tp_i,tp_j,tp_k will be located at runtime but
CAN generate code to determine where tp_i,tp_j,tp_k will
be located at runtime.

This will generate additional overhead which may or may not be
significant in your program. In cases where it is significant
use transitional variables.

Example: if tp_i, tp_j and tp_k are index bases within an array
and each thread is to manipulate portions of an array relative
to those bases

do i=0,count-1
do j=0,count-1
do k=0,count-1
array(tp_i+i, tp_j+j, tp_k+k) = expression
end do
end do
end do

Then you are programming with unnecessary overhead.
using transitional variables

i_base=tp_i
j_base=tp_j
k_base=tp_k
do i=0,count-1
do j=0,count-1
do k=0,count-1
array(i_base+i, j_base+j, k_base+k) = expression
end do
end do
end do

and this reduces the thread private access ovehead to one occurance
*** the above assumes tp_i,tp_j,tp_k do not vary during execution
of loop

François-Xavier · ‎07-31-2009

Dear,

Sorry for this late response (i was abroad), thank you for your answer i will test your solution and post the results

I consider this post as answered since the problem is understood.

I thank all the people who responded.

Sincerely yours,

F-Xavier

jimdempseyatthecove · ‎07-31-2009

F-Xavier,

I am sure you are aware that the loop I illustrated can be rewritten

[cpp]! convert 
i_base=tp_i
j_base=tp_j
k_base=tp_k
do i=0,count-1
do j=0,count-1
do k=0,count-1
array(i_base+i, j_base+j, k_base+k) = expression
end do
end do
end do

! rewritten as
do i=tp_i,tp_i+count-1
do j=tp_j,tp_j+count-1
do k=tp_k,tp_k+count-1
array(i, j, k) = expression
end do
end do
end do
[/cpp]

Jim