- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Intel forumers,
I have recently introduced THREADPRIVATE statements in some of my Fortran commons in order to make the variables contained in the commonsprivatefor each thread. These commons contain quite large variables. Dynamic threading is set to OFF and i use same number of threads in all parallel regions.
I noticed that, when i set the number of thread to 1, the performances of the program with/without the threadprivate attribute are very different. (30-40 % slower with threadprivate attributes when running optimized (O2) version of the code).
I have carefully read the documentation andI don't really understand why. In these conditions, I thought the memory was allocated once when i first used the threadprivate common in the first parallel region and stay "alive" during all the program execution. So, the "cost" of the threadprivate should be only a one shot at the beginning. It seems not to be the case. Could you tell me more about it? Are there options to optimize the use of threadprivate statements?
I use Intel Fortran Windows Compiler 10.0.0.27.
Thank you for your Help,
F-Xavier Bouchez
I have recently introduced THREADPRIVATE statements in some of my Fortran commons in order to make the variables contained in the commonsprivatefor each thread. These commons contain quite large variables. Dynamic threading is set to OFF and i use same number of threads in all parallel regions.
I noticed that, when i set the number of thread to 1, the performances of the program with/without the threadprivate attribute are very different. (30-40 % slower with threadprivate attributes when running optimized (O2) version of the code).
I have carefully read the documentation andI don't really understand why. In these conditions, I thought the memory was allocated once when i first used the threadprivate common in the first parallel region and stay "alive" during all the program execution. So, the "cost" of the threadprivate should be only a one shot at the beginning. It seems not to be the case. Could you tell me more about it? Are there options to optimize the use of threadprivate statements?
I use Intel Fortran Windows Compiler 10.0.0.27.
Thank you for your Help,
F-Xavier Bouchez
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What you observe is correct. Access to thread private variable will induce some additional overhead. The overhead per access is small but noticable. In the event of a simple routine such as presented inyour example the overhead is significant. However, the overhead may be insignificant when performed in practice in your real application.
I suggest you place a break on the routines and opening the dissassembly window. Examine both the TLS loads and the non-TLS loads. You will notice a few extra assembly instructions to accomplish the stores. (some of these instructions can be optimized out in Release mode).
Example:
Assume you have a small vector containing the X,Y and Z components called POS(3)and you wish to rotate this vector. When this vector is in thread local storage the CALL ROTATE(POS, ROT) would incure the small overhead only once in constructing the address of POS (or array descriptor for POS), but the routine ROTATE would contain no such overhead.
Your alternative to using Thread Local Storage is to pass a thread context pointer in all subroutine and function calls that require thread contexed information (big programming effort). The overhead in performing that generally far exceeds the compiler performing this for you. Essentially the thread local storage becomes something similar to THREADCONTEXT%ASSEMBLYCONTEXT%yourTLSVariableHere
Where THREADCONTEXT is a vendor method for obtaining a thread by thread context area, ASSEMBLYCONTEXT is a vendor method of obtaining an assembly (compile time object) specific thread context variable and yourTLSVariableHere is your thread local storage variable name. But the compiler does this automagicly thus hiding the THREADCONTEXT%ASSEMBLYCONTEXT% and providing a portable programming means.
Jim Dempsey
Link Copied
15 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you compile with OpenMP enabled .AND. specify one thread, the code generated will still use the threadprivate access methods for dereferencing threadprivate variables (a few extra instructions per dereference).
Therefore any slowdown due to threadprivate will be visible in the one thread OpenMP produced application using threadprivates. When you observe a slowdown when going from one thread (in mt generated program) to multiple threads then suspect adverse memory interactions (cache evictions). Or, perhaps, making a parallel version of a short loop that cannot ammortize the thread start/stop overhead in a cost effective manner.
Can you supply a small code example?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You would expect to pay the cost of copying data in and out of threadprivate at the beginning and end of each parallel region, which could be substantially more than the bare allocation cost. I don't know how the out-dated compiler would affect this, except that it doesn't include the current version of OpenMP library.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
You would expect to pay the cost of copying data in and out of threadprivate at the beginning and end of each parallel region, which could be substantially more than the bare allocation cost. I don't know how the out-dated compiler would affect this, except that it doesn't include the current version of OpenMP library.
Dear tim, thank you for your response, I haven't set any "options" to activate the copy of threadprivate variables at the end of parallel region. In fact,the threadprivate variables are just a set of "temporary variables" shared across some subroutines which don't need to be used at the end of the parallel region. In this situation, are there default copies at the end of a parallel region ?
This way of coding could be discussed. In fact, i try to parrallelize some old big Fortran codes with a minimum of changes in the structure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
You would expect to pay the cost of copying data in and out of threadprivate at the beginning and end of each parallel region, which could be substantially more than the bare allocation cost. I don't know how the out-dated compiler would affect this, except that it doesn't include the current version of OpenMP library.
Tim,
The shared (local) data (scalars or descriptor) would be copied (as hidden dummy arguments) but the threadprivate data persists (no copying required).
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
When you compile with OpenMP enabled .AND. specify one thread, the code generated will still use the threadprivate access methods for dereferencing threadprivate variables (a few extra instructions per dereference).
Therefore any slowdown due to threadprivate will be visible in the one thread OpenMP produced application using threadprivates. When you observe a slowdown when going from one thread (in mt generated program) to multiple threads then suspect adverse memory interactions (cache evictions). Or, perhaps, making a parallel version of a short loop that cannot ammortize the thread start/stop overhead in a cost effective manner.
Can you supply a small code example?
Thank you for your response.
This is the idea of the global structure:
toto.h:
COMMON /TEST/ TMPVAR(A_BIG_NUMBER)
!$OMP THREADPRIVATE (/TEST/)
first_routine.f
INCLUDE 'toto.h'
... work on TMPVAR in subroutines...
second_routine.f
!$OMP PARALLEL
CALL first_routine(...)
!$OMP END PARALLEL
third_routine.f
CALL first_routine(...)
In this structure, first_routine can be called inside or outside a parallel region. Since this first_routine use a common declared thread private, how will this common behave when used outside a parallel region? can i manage the code in an other way?
Results are good, but performance not. I noticed that (to be confirmed) when i suppress my parrallel regions but when i let the threadprivate commons, the code remains slow.
Since I use OPENMP2, maybe some improvements have been made in OPENMP3 (I will receive IFORT 11 soon)
Sincerely yours
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]toto.h: COMMON /TEST/ TLSVAR(A_BIG_NUMBER) !$OMP THREADPRIVATE (/TEST/) COMMON /TESTTWO/ STATICVAR(A_BIG_NUMBER) first_routine.f SUBROUTINE FIRST_ROUTINE(FOO) DIMENSION FOO(A_BIG_NUMBER) INCLUDE 'toto.h' FOO(INDEX) = expression ! modifies cell in array passed as dummy FOO TLSVAR(INDEX) = expresssion ! modifies current threads (may be main) STATICVAR(INDEX) = expresssion ! modifies cell in static array second_routine.f DIMENSION LOCALVAR(A_BIG_NUMBER) ! assume it fits on stack !$OMP PARALLEL SHARED(LOCALVAR) CALL first_routine(TLSVAR) ! FOO becomes current threads TLSVAR CALL first_routine(LOCALVAR) ! FOO becomes current threads LOCALVAR CALL first_routine(STATICVAR) ! FOO becomes all threads STATICVAR !$OMP END PARALLEL third_routine.f ! (not in parallel region) DIMENSION LOCALVAR(A_BIG_NUMBER) ! assume it fits on stack CALL first_routine(TLSVAR) ! FOO becomes current threads TLSVAR (same as MAIN thread TLSVAR) CALL first_routine(LOCALVAR) ! FOO becomes current threads LOCALVAR CALL first_routine(STATICVAR) ! FOO becomes all threads STATICVAR [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you can send a working (small) code section illustrating your problem we can better assist you in determining the problem (usualy a coding error on your part due to unfamiliarity of programming model).
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
If you can send a working (small) code section illustrating your problem we can better assist you in determining the problem (usualy a coding error on your part due to unfamiliarity of programming model).
Jim
Dear, i have created a simple sample code which resume my performance problem. You'll see that there is no parallel section, only Threadprivate statements. (PS: I know that using local variables, this example will be far much faster, but it's not the aim of the topic)
When i execute this code (release), i obtain:
NON-THREADPRIVATE LOOP TIME = 17.9481196632551
THREADPRIVATE LOOP TIME = 19.7407659353339
Press any key to continue . . .
I would like to know more about this performance gap. I am sorry for the presentation of the code, i am quite new using this interface.
[cpp]Threadprivate.h integer*4 i1, i2, i3, i4, i5, i6 COMMON /TEST / i1, i2, i3 COMMON /TEST2/ i4, i5, i6 !$OMP THREADPRIVATE (/TEST/) My_main.f program My_threadprivate_example use ifport implicit none include 'omp_lib.h' integer*4 i double precision ytim1, ytim2 ytim1 = OMP_GET_WTIME() do i=1,10000000000 call My_second_routine() enddo ytim2 = OMP_GET_WTIME() write(*,*) 'NON-THREADPRIVATE LOOP TIME = ', ytim2 - ytim1 ytim1 = OMP_GET_WTIME() do i=1,10000000000 call My_first_routine() enddo ytim2 = OMP_GET_WTIME() write(*,*) 'THREADPRIVATE LOOP TIME = ', ytim2 - ytim1 i = SYSTEM ("PAUSE") end program My_threadprivate_example My_first_routine.f subroutine My_first_routine include 'Threadprivate.h' i1 = 1 i2 = 2 i3 = 3 end subroutine My_second_routine.f subroutine My_second_routine include 'Threadprivate.h' i4 = 1 i5 = 2 i6 = 3 end subroutine
[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What you observe is correct. Access to thread private variable will induce some additional overhead. The overhead per access is small but noticable. In the event of a simple routine such as presented inyour example the overhead is significant. However, the overhead may be insignificant when performed in practice in your real application.
I suggest you place a break on the routines and opening the dissassembly window. Examine both the TLS loads and the non-TLS loads. You will notice a few extra assembly instructions to accomplish the stores. (some of these instructions can be optimized out in Release mode).
Example:
Assume you have a small vector containing the X,Y and Z components called POS(3)and you wish to rotate this vector. When this vector is in thread local storage the CALL ROTATE(POS, ROT) would incure the small overhead only once in constructing the address of POS (or array descriptor for POS), but the routine ROTATE would contain no such overhead.
Your alternative to using Thread Local Storage is to pass a thread context pointer in all subroutine and function calls that require thread contexed information (big programming effort). The overhead in performing that generally far exceeds the compiler performing this for you. Essentially the thread local storage becomes something similar to THREADCONTEXT%ASSEMBLYCONTEXT%yourTLSVariableHere
Where THREADCONTEXT is a vendor method for obtaining a thread by thread context area, ASSEMBLYCONTEXT is a vendor method of obtaining an assembly (compile time object) specific thread context variable and yourTLSVariableHere is your thread local storage variable name. But the compiler does this automagicly thus hiding the THREADCONTEXT%ASSEMBLYCONTEXT% and providing a portable programming means.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I totally agree with your explanation.
In my case, I notice that the overhead becomes very importantfor the commons which contains lots of very small variables and used in functions called millions of times.
F-Xavier Bouchez
In my case, I notice that the overhead becomes very importantfor the commons which contains lots of very small variables and used in functions called millions of times.
F-Xavier Bouchez
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For the variables in thread private areas you can boost them to stack local at entry to subroutine.
subroutine foo()
use xxx ! or include 'yourthreadprivate.h'
real, pointer :: modifiableVar
integer :: copyConstantInt
real :: transitionalVar2
...
modifiableVar => tlsVar
copyConstantInt = tlsInt
transitionalVar2 = tlsVar2
... ! use modifiableVarcopyConstantInttransitionalVar2
tlsVar2 = transitionalVar2 ! restore modified local transitional copy
end subroutine foo
or
subroutine foo()
use xxx ! or include 'yourthreadprivate.h'
real, pointer :: modifiableVar
integer :: copyConstantInt
real :: transitionalVar2
...
modifiableVar => tlsVar
copyConstantInt = tlsInt
transitionalVar2 = tlsVar2
#define tlsVar modifiableVar
#define tlsInt copyConstantInt
... ! use tlsVar tlsInt transitionalVar2
tlsVar2 = transitionalVar2 ! restore modified local transitional copy
end subroutine foo
Note, #define is active from line it is located on through end of compilation unit (not end of subroutine)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
For the variables in thread private areas you can boost them to stack local at entry to subroutine.
subroutine foo()
use xxx ! or include 'yourthreadprivate.h'
real, pointer :: modifiableVar
integer :: copyConstantInt
real :: transitionalVar2
...
modifiableVar => tlsVar
copyConstantInt = tlsInt
transitionalVar2 = tlsVar2
... ! use modifiableVarcopyConstantInttransitionalVar2
tlsVar2 = transitionalVar2 ! restore modified local transitional copy
end subroutine foo
or
subroutine foo()
use xxx ! or include 'yourthreadprivate.h'
real, pointer :: modifiableVar
integer :: copyConstantInt
real :: transitionalVar2
...
modifiableVar => tlsVar
copyConstantInt = tlsInt
transitionalVar2 = tlsVar2
#define tlsVar modifiableVar
#define tlsInt copyConstantInt
... ! use tlsVar tlsInt transitionalVar2
tlsVar2 = transitionalVar2 ! restore modified local transitional copy
end subroutine foo
Note, #define is active from line it is located on through end of compilation unit (not end of subroutine)
Jim Dempsey
I am awfully sorry but this time i don't really understand the piece of code and the aim of each variable.
How could it be applied in the sample code i provided? Will it really help?
Sincerely yours,
Thank you for your help,
F-Xavier Bouchez
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
common variables
i
j
k
thread private variables
tp_i
tp_j
tp_k
when compiled for multi-threaded use creates multiple context areas
thread 0 context thread 1 context ... thread n context
tp_i tp_i tp_i
tp_j tp_j tp_j
tp_k tp_k tp_k
The compiler will be able to generate code
that knows where i,j,k will be located at runtime.
The compiler will NOT be able to generate code
that knows where tp_i,tp_j,tp_k will be located at runtime but
CAN generate code to determine where tp_i,tp_j,tp_k will
be located at runtime.
This will generate additional overhead which may or may not be
significant in your program. In cases where it is significant
use transitional variables.
Example: if tp_i, tp_j and tp_k are index bases within an array
and each thread is to manipulate portions of an array relative
to those bases
do i=0,count-1
do j=0,count-1
do k=0,count-1
array(tp_i+i, tp_j+j, tp_k+k) = expression
end do
end do
end do
Then you are programming with unnecessary overhead.
using transitional variables
i_base=tp_i
j_base=tp_j
k_base=tp_k
do i=0,count-1
do j=0,count-1
do k=0,count-1
array(i_base+i, j_base+j, k_base+k) = expression
end do
end do
end do
and this reduces the thread private access ovehead to one occurance
*** the above assumes tp_i,tp_j,tp_k do not vary during execution
of loop
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear,
Sorry for this late response (i was abroad), thank you for your answer i will test your solution and post the results
I consider this post as answered since the problem is understood.
I thank all the people who responded.
Sincerely yours,
F-Xavier
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
F-Xavier,
I am sure you are aware that the loop I illustrated can be rewritten
[cpp]! convert i_base=tp_i j_base=tp_j k_base=tp_k do i=0,count-1 do j=0,count-1 do k=0,count-1 array(i_base+i, j_base+j, k_base+k) = expression end do end do end do ! rewritten as do i=tp_i,tp_i+count-1 do j=tp_j,tp_j+count-1 do k=tp_k,tp_k+count-1 array(i, j, k) = expression end do end do end do [/cpp]
Jim

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page