OpenMp - passing private values within a dynamic extent

YertleTheTurtle · ‎07-18-2010

Hi:

I have been experimenting with OpenMp- trying to pass a module variable into a called subroutine within the dynamic extent of a parallel block.

So, I made a short program to experiment.

If I understand correctly, when a subroutine is called from within a parallel
block, each thread gets its owncopy of the subroutine, and corresponding copies of any "private" variablesare created -one foreach thread.

The program and output given below does just that and demonstrates that my understanding is potentially correct. I realize that the value of "jj" inside "subtest" isn't always going to be the same as the threadnumberobtained from inside that subroutine (depends on timing), but as long as "jj.ne.0", the point has been made. The compilation directive is:

ifort /Qopenmp omptest.f90

But, when I incorporate the same thing in my real program, the samestructuredoesn't work.

The only difference is that the "real program"is split into many files, corresponding toone file for each subroutine and module, rather than one single file for the entire program, as in theexample program.

If you split the demonstration program given below into 3 files (main program, subtest.f90 and jvar.f90), it no longer works - the variable "jj" always equals zero - the value associatied with the master thread- rather than a unique value for each thread. See second output below.

The compilation directive (3 files in the directory) is:

ifort /Qopenmp *.f90

So, is there a compiler/link problem here, or am I missing something?
Yert

PS. I also tried to use "threadprivate" rather than "private". It also does the same thing. BUT - the documentation for "threadprivate" in the Intel Fortran "help" file says that "threadprivate" is only supposed to be used for "named common blocks inside a module". So why does the example below not give a compilation error when "threadprivate" is used with thevariable name "jj"?
The openMp documentation doesn't limit "threadprivate" to common blocks only.

EXAMPLE PROGRAM

module jvar
integer :: jj
end module jvar

program omptest
use jvar, only :jj
use omp_lib

integer :: j

!!$OMP threadprivate (jj)

!$omp parallel private(jj)
!$omp critical (test)
jj = OMP_Get_Thread_num()
write (10,*) ' calling with jj= ', jj
!$omp end critical (test)

call subtest()
!$OMP end parallel

stop
end

subroutine subtest()
use jvar, only : jj
use omp_lib
write (10,*) ' inside with jj= ', jj, 'threadnumber=',OMP_get_thread_num()
return
end subroutine subtest

CORRECT OUTPUT

_____________________________

calling with jj= 1
inside with jj= 1 threadnumber= 1

calling with jj= 4
inside with jj= 4 threadnumber= 4

calling with jj= 2
inside with jj= 2 threadnumber= 2

calling with jj= 5
inside with jj= 5 threadnumber= 5

calling with jj= 0
inside with jj= 0 threadnumber= 0

calling with jj= 6

inside with jj= 6 threadnumber= 6

calling with jj= 7

inside with jj= 7 threadnumber= 7

calling with jj= 3

inside with jj= 3 threadnumber= 3

INCORRECT OUPUT - CREATED BY SPLITTING THE ABOVE PROGRAM INTO THREE SEPARATE FILES
___________________________________________________________________

calling with jj= 2
inside with jj= 0 threadnumber= 2

calling with jj= 0
inside with jj= 0 threadnumber= 0

calling with jj= 1
inside with jj= 0 threadnumber= 1

calling with jj= 5
inside with jj= 0 threadnumber= 5

calling with jj= 3
inside with jj= 0 threadnumber= 3

calling with jj= 6
inside with jj= 0 threadnumber= 6

calling with jj= 4
inside with jj= 0 threadnumber= 4

calling with jj= 7
inside with jj= 0 threadnumber= 7

Arjen_Markus · ‎07-18-2010

If the program is contained in a single file, then the compiler can analyse it in its entirety. It will
probably see that subroutine subtest is used in a threaded context and then take care of it.

However, if you split off subtest, then the compiler can not see that. You will probably have to
declare jj to be threadprivate in the module jvar, rather than in the main program.

It is this sort of subtleties that makes using OpenMP less than trivial.

Regards,

Arjen

TimP · ‎07-19-2010

Intel OpenMP doesn't automatically analyse or correct threading problems simply on account of visibility of all functions at compile time. It's always possible that optimization will correct a failure to specify private (somewhat by accident), but it's not acceptable to count on this.
Better diagnosis of threading problems is promised for the next release, but it still won't happen unless you ask for it, and will continue to require data sets which exercise the necessary execution paths.

Xiaoping_D_Intel · ‎07-19-2010

AccordingOpenMP standard the module variable 'jj' referenced in subroutine 'subtest' should be shared:

2.9.1.2 Data-sharing Attribute Rules for Variables Referenced in a Region but not in a Construct

...

Variables belonging to common blocks, or declared in modules, and referenced in

called routines in the region are shared unless they appear in a threadprivate

directive.
...

So the correct output in the post is actually incorrect and the incorrect output should be correct.

If you disable the single file IP optimization by "/Qip-" option the results of single file and multiple files will be the same. It should be an ifort bug. I will open a bug report for it and post its status here.

Why no error against the 'threadprivate' in the code is "!!$omp threadprivate(jj" was taken as comment.

jimdempseyatthecove · ‎07-19-2010

Your subroutine subtest is not receiving the thread private j as a dummy argument. Rather it is pulling in a global value of j from MODULE jvar. You have two methods to correct your programming problem

1) pass a reference or value of j to subtest
2) make j "threadprivate" in module

Also note an additional potential error

OMP_get_thread_num() does NOT return a global thread number. Rather it returns the team member number of the current parallel region. Only when nested parallel regions is OFF, or if you never use nested parallel regions,will the assumption of OMP_get_thread_num() being an application wide thread number be true. Therefore do not ASSUME 0 means the main level thread. While this assumption may hold true in your simplified tests, it may very well hold false within your actual program.

Jim Dempsey

YertleTheTurtle · ‎07-19-2010

Thanks to all who answered.

Making "jj" threadprivate either inside the module (obviously preferred) or in each of the subroutines where it is used solves the immediate problem.... BUT

What if the variable is an allocatable array - see theexample below.

In that case, both the main program and the called subroutine "subtest" get the lower bound of the array correctly, but the main program andsubroutine in every thread get the upper bound incorrectly (see the first few lines of output below).

However, when the "allocate" statement is placed inside the "Parallel" block, the program works as expected.

Is this a compiler or programming issue ?

Comments on other replies: yes, module variables are "shared" by default, but the program need is for "private" variables from a module.The big program has about 30 or so such variables, and passing them as subroutine arguments is inefficient, although it will work, as was suggested. I tried /Qip, to no effect.

Thanks again.
Yert

________________Sample program with allocatable threadprivate array_____________

module jvar
integer :: jj
integer, allocatable, dimension(:) :: itest
!$OMP threadprivate (jj, itest)
end module jvar

program omptest
use jvar, only :jj, itest
use omp_lib
integer :: j
allocate (itest(1:5))
!$omp parallel ! private(jj)
!$omp critical (test)
jj = OMP_Get_Thread_num()
write (10,*) ' calling with jj= ', jj,' u=',ubound(itest),' l=',lbound(itest)
!$omp end critical (test)
call subtest()
!$OMP end parallel
stop
end

subroutine subtest()
use jvar, only : jj, itest
use omp_lib
write (10,*) ' inside with jj= ', jj, 'threadnumber=',OMP_get_thread_num(), &
' u=',ubound(itest),' l=',lbound(itest)
return
end subroutine subtest

__________FIRST FEW LINES OF OUTPUT______________

calling with jj= 0 u= 5 l= 1

inside with jj= 0 threadnumber= 0 u= 5 l=1

calling with jj= 5 u= 0 l= 1

inside with jj= 5 threadnumber= 5 u= 0 l=1

jimdempseyatthecove · ‎07-20-2010

You have a programming problem.

Only the main thread allocated its private copy of itest

When each thread is to have a different itest array, as programmed in your module with threadprivate attribute, then each thread must allocate its copy of the array.

When each (all) threads are to share the sameitest, then do not make it threadprivate and allocate it once (by any thread) prior to use.

Jim Dempsey

YertleTheTurtle · ‎07-20-2010

Thank you.

Where is such information written down?

Can you recommend a book or report?

jimdempseyatthecove · ‎07-21-2010

I am sorry to say I do not have a good book to point you at for OpenMP programming in Fortran. Forum member tim18 might be a good source for a reference.

The suggestions I made are (to me) quite obvious.

Prior to using an array, you must allocate it

When you have one instance of the array you need to allocate it at least once before use.

When you have N instances of the array, you need to allocate it at least once for each instance its being used.

I caution against looking for code samples you can cut and paste. Without a fundamentalunderstanding for what you are doing, you are in for a frustrating experience. Parallel programming is not all that hard, it just takes some common sense to become familiar with the basic principals and requirements.

Jim Dempsey

YertleTheTurtle · ‎07-21-2010

I guess I am looking for a fundamental understanding, because obviously something subtle is going on as shown by the two examples.

In the first example, a module variable "jj" is "cloned" (if I may use the word) N times, once for each thread, by the OpenMp compiler. The programmer tells the compiler to do this by declaring "jj" to have the "threadprivate" attribute.

In the second case, once the array "itest" is allocated by the program and declared "threadprivate", you'd expect that the compiler would "clone" it N times in exactly the same manner. But it doesn't. The allocation operation itself needs to be cloned "N" times, and the programmer has to know this. So, the underlying question is what is the difference between cloning an allocated array versus cloning a declared variable and why do they have to be cloned differently? Better yet, where are the rules written down so that one doesn't have to guess about such subtleties. Your interpretation makes sense, but so does mine.

I have found discrepancies between the "help" file summaries for "OpenMp" routines that come with the Intel compiler and OpenMp documentation available elsewhere (some of which applies to Fortran). For example, the description of "threadprivate" in the "help" file is inconsistent with OpenMp tutorials (see my first post in this thread). Anyway, I just ordered an OpenMp book (sight unseen), so maybe some of these questions will be answered. If the book turns out to be any good, I'll post a comment here in a few weeks.

Thanks again to all who answered.

jimdempseyatthecove · ‎07-22-2010

>>In the first example, a module variable "jj" is "cloned" (if I may use the word) N times, once for each thread

jj is a symbolic name of a memory location known by the program. The symbolic name is "cloned" N times to N different locations (for threadprivate variables). The contents are not "cloned" excepting for when you use COPYIN or COPYFIRST on your OpenMP statements but normally you do not use COPYIN for threadprivate variables. Usualy only for local variables declared PRIVATE where effectively the local variable declared PRIVATE becomes transparently a DUMMY valueargument (on stack) in a hidden call to the parallel construct. Thus providing for each thread to have a different copy of the variable (initialized when COPYIN applied to the !$OMP statement for that variable).

>>In the second case, once the array "itest" is allocated by the program and declared "threadprivate", you'd expect that the compiler would "clone" it N times in exactly the same manner. But it doesn't

itest is an array descriptor (not the contents of the array). The array descriptor is declared at compile time as thread private (not after allocation). The array descriptor is "cloned" N times to N different locations (for threadprivate variables) and this cloning of the array descriptor occures at thread creation time (prior to allocation by main thread). The contents are not "cloned" excepting for when you use COPYIN or COPYFIRST on your OpenMP statements. It is unclear to me as to what happens if you use COPYIN on a thread private variable that is NOT one of the PRIVATE variables of the scope of the !$OMP statement.

Note, if you were to attempt COPYIN on threadprivate variable that is not one of the PRIVATE listed variables on the !$OMP statement and receive an error message. And then subsequent to this error message (in reaction toerror stating COPYIN can only be use on variables declared PRIVATE), declare the threadprivate variable as PRIVATE, what will happen is you will get an additional private copy of this thread private variable (or array).

In your example program (itest is threadprivate), each thread has the responsibility (programmer has responsibility) to allocate itest .AND. to fill it with whatever you want. (may be same stuff, ordinarily different stufff).

If your intention is for all threads to share the same itest array data THEN do not make itest threadprivate and then only allocate once andpopulate onceprior to entering the parallel region.

Jim Dempsey

YertleTheTurtle · ‎07-23-2010

Thank you for taking the time to write all that. I've read your reply several times, and it is (slowly)sinking in.

Supplemental question:

Since each thread mustALLOCATE an allocatable, threadprivate"array descriptor" as youcall it, is it necessary for each thread to DEALLOCATE the same arrayprior tothread termination?

Ifit is not deallocated, will a memory leak result or does the compiler automatically take care of this?

jimdempseyatthecove · ‎07-25-2010

The thread should deallocate any thread private allocatable.

Note, in OpenMP the application does not have control over thread termination. Well I should rephrase this to, in OpenMP, the application cannot explicitle terminate OpenMP threads and maintain orderly execution/shutdown. An orderly shutdown has the main thread return to main level, then return from the main program. While the main thread could issue STOP at any level of parallel regions, doing so at any level other than the main level would not be orderly. Should such an unorderly STOP (or call to EXIT, etc) terminate the main thread, the other threads will have a disorderly termination. Memory will get returned, but file handles (files)may not necessarily be closed in an orderly fashion.

Rephrase of above:

Should youbegin a parallel region, and while within the parallel region a thread detects an abnormal condition (or executes an instruction producing a fault), and then should this thread terminate, then OpenMP, and thus the application in general, is crippled, and will soon die. Do not issue STOP, CALL EXIT(n), etc... by a thread in an application unless you wish to stop the application. STOP,... is not to be used as a means of an early exit of a parallel region.

Jim Dempsey