I have a problem with porting my serial fortran (parallel studio XE 2018 update3, on Win7) program to OpenMP.
Problem: the Visual Studio debugger cannot see the threadprivate variables.
The original program is a serial one which I have long been using.
calculation depending on the loop index is here
many calls are included
I started an adaptation of this to a parallel one, by encapsulating all the codes within the outermost loop in MyProg into a subroutine MyProg_core.
The specific behavior of MyProg_core is solely determined by the i-th condition.
All the related routines are in the dynamic extents.
The parallel version becomes:
other ordinary variables var1, var,2,..
some serial codes
!$OMP parallel shared(all vars in mod_global_vars)
!$OMP do schedule(static,1)
do i=1, num_jobs
call MyProg_core(i, var1, var2,..) ! all calls are in the dynamic extent
!$OMP end do
!$OMP end parallel
All common and read-only variables shared by the threads are gathered in the module below:
integer*4 G_var1, G_var2,.., ...
real*8, allocatable::G_array1(:), G_array2(:,:), ..
All private variables that can be modified in each thread are in the following module, together with the threadprivate list:
real*8, allocatable,save:: P_var1, P_var2,...
integer*4, save:: P_array1(:), P_array(:,:),...
!$OMP threadprivate(list of all vars in this module)
"use PrivateVars" is placed only in MyProg_core and deeper routines inside using the corresponding variables, i.e., in the dynamic extent only.
The PROBLEM is:
When this code is run in the visual studio debugger, all the threadprivate variables are not visible but displaying "undefined address", in the main window or parallel watch window. Shared variables are on the other hand always visible.
The invisible variables, however, can be printed on the screen using "write" statement. Also, when an invisible threadprivate P_var is copied to ordinary var as var=P_var, var shows the value of P_var in the debugger. The calculation runs normally and outputs the final results. Therefore, only the debugger fails to see their values, which are correctly kept in each thread during the execution.
Can you suggest the possible reason for this phenomena?
If I understand it right is the code doing something like:
module testmod implicit none integer,parameter:: n=5 integer,allocatable,save:: iprivate(:) !$omp threadprivate (iprivate) contains subroutine initiate_iprivate() allocate(iprivate(n)) iprivate(:)=-1 end subroutine subroutine sub(ii) integer,intent(in)::ii iprivate(1:n)=ii continue end subroutine sub end module program test use testmod use omp_lib implicit none integer:: i,ii call initiate_iprivate() call OMP_SET_NUM_THREADS(2) !$omp parallel do private(ii) do i=1,10 ii=i call sub(ii) write(*,*) i enddo !$omp end parallel do end
This code gives different kinds of errors as undefined allocation or allocated to strange parts of memory space (IVF 18 + VS 2013) causing VS to crash.
Re-wirting it to a more "pure" form solves the problem:
module testmod implicit none integer,parameter:: n=5 contains pure subroutine sub(ii,initiated_iprivate) integer,intent(in)::ii,initiated_iprivate(n) !dir$ define alt_alloc !dir$ if defined(alt_alloc) integer,allocatable:: iprivate(:) ! use allocaded memory if n is large allocate(iprivate(n)) !dir$ else integer:: iprivate(n) ! puts the vector in the stack !dir$ endif iprivate(:)=initiated_iprivate(:) ! copy the common initiation values to the threadprivate work space iprivate(1:n)=ii ! do the work continue !dir$ if defined(alt_alloc) deallocate(iprivate) !dir$endif end subroutine sub end program test use testmod use omp_lib implicit none integer:: i,ii integer,allocatable:: initiated_iprivate(:) allocate(initiated_iprivate(n)) initiated_iprivate(:)=-1 call OMP_SET_NUM_THREADS(2) !$omp parallel do private(ii) do i=1,10 ii=i call sub(ii,initiated_iprivate) write(*,*) i enddo !$omp end parallel do end
Post #2, first example is in error. The call to init_threadprivate is (was) made from the serial region (also prior to thread pool creation), thus initializing only the threadprivate array of the master thread. To correct for this, enclose the call to init_threadprivate within a (the first) parallel region (being careful to assure thread count is what you desire).
Also, from the IVF documentation:
On entry to the first parallel region, data in the THREADPRIVATE common blocks or variables should be assumed to be undefined unless a COPYIN clause is specified in the PARALLEL directive.
However, the above statement is not definitive as to what happens with allocatable arrays declared threadprivate. IOW does the array descriptor get copied, thus producing separate array descriptors with same content (pointing to the same pre-allocated data and thus producing heap corruption on first deallocate/reallocate). A copyin'd empty array descriptor is safe.
Thank you, LeonardB and Jim,
Though I have not understood your comments completely, May I make a further explanation about the initialization of variables?
At the beginning of MyProg_core (inside the parallel loop, in the dynamic extent), both shared and threadprivate arrays are initialized as follows.
call AllocSharedArrays ! for shared arrays
allocate(P_array1(..), .. ) ! for threadprivate arrays
*** end codes
AllocSharedArrays allocates shared array only once when the first thread arrives at this point. It checks whether these are already allocated or not as follows, IF( ALLOCATED(G_array1)) return
Because all shared arrays are simultaneously allocated or deallocated, only the first array in the argument list is checked.
For threadprivate arrays, they are deallocted at the exit of MyProg_core and allocated again at the next entry.
After these allocations, initial data are loaded.
CALL InputParallel ! get shared values
!$OMP end critical
*** end codes
This call sets the shared variables from the input file. Therefore, calls in the second or later threads are redundant. However, I leave it for simplicity. The time for overwriting the shared variables in each thread is negligible. I think this has no side effects.
After calling InputParallel, these two routines are called to prepare the threadprivate data.
*** end codes
There are two kinds of threadprivate variables. (1)those read from the input file as above and to be modified as the calculation proceeds, and (2)those generated and used only internally in MyProg_core.
CopySharedVarsToPrivate copies the shared to the threadprivate of former type as their initial values,
The latter ones are reset in ResetInternalVars to 0 or "", etc.
in order to avoid unexpected errors.
Thus, all the threadprivate variables are allocated and initialized at the beginning of each thread, I believe...
As I mentioned in the first post, the program seems to output correct results, which looks like the ones which had been given by the serial codes.
Please show the code to the extent of when/where you enter the parallel region .AND. how you protect against race conditions. You may elide code not require to illustrate correctness.
you>> AllocSharedArrays allocates shared array only once when the first thread arrives at this point. It checks whether these are already allocated or not as follows, IF( ALLOCATED(G_array1)) return
Potentially has an error, due to code not seen, it is unknown if race conditions are protected. The proper coding to protect race condition would look like:
IF( ALLOCATED(G_array1)) return !$OMP CRITICAL IF(.NOT. ALLOCATED(G_array1)) then ALLOCATE(G_array1(something)) ENDIF !$OMP END CRITICAL return
You may have additional issues if the allocation of the shared arrays additionally require initialization. If this is the case then consider adding a LOGICAL variable. Something like:
LOGICAL, SAVE :: G_arrays_ready = .FALSE. ... IF( G_arrays_ready) return !$OMP CRITICAL IF(.NOT. G_arrays_ready) then ALLOCATE(G_array1(something)) call InitGarrays G_arrays_ready = .TRUE. ENDIF !$OMP END CRITICAL return
And don't forget similar code to reset G_arrays_ready. Also, assure that you do not enter the deallocation section prior to all threads finishing use of the shared data.
I have prepared a small OpenMP console code files that reproduce the symptom, by shrinking the original project. My system is Win7 x64 with 16 cores. Fortran is Parallel Studio XE 2018 Update 3 Composer Edition for Win.
10 files are included as follows.
1) test_main.FOR ( program test_main): This sets and reports the number of threads, and then construct the parallel region. The number of jobs is 16. Basic constants used in array allocation is set by calling GetDataSizes.
2) test_core.for (subroutine test_core): This is the whole body of parallel processing. It displays the loop count and its own thread ID on the screen when it is called. However, the debugger cannot display thread ID because of 'undefined address.' This routine also does the followings, retrieving the data used in the calculation, array allocation and deallocation, data copy from shared to threadprivate variables, two types of data assignments.
3) mod_GlobalVars.for (module GlobalVars): Contains the global variables declared as SHARED in the test_main.
4) Mod_Vars.for (module VarDecs): Contains thread variables with !$OMP threadprivate list.
5) Input_parallel.for (subroutine InputParallel): Mimics File I/O.
6) GetDataSizes.for (subroutine GetDataSizes); Mimics File I/O of array size constants.
7) AllocSharedArrays.for (subroutine AllocSharedArrays): For 1-time allocation of shared arrays.
8) DeallocSharedArrays.for (subroutine DeallocSharedArrays): Deallocation of shared arrays.
9) DeallocPrivateArrays.for (subroutine DeallocPrivateArrays): For deallocation of threadprivate arrays. The corresponding allocation is inlined in sub test_core.
10) CopySharedToPrivate.for (subroutine CopySharedVarsToPrivate): Copy the shared values to corresponding threadprivate ones.
The reproduced symptom is:
All the threadprivate variables are not visible because of 'undefined address' on the debugger. However, except for this, the calculation seems to go well. The invisible variables can be printed on the screen and can be used in the loop control without errors. Assignment operation also works successfully. When an invisible thread private variable is copied to a local variable, it shows the correct value. These (and full-size project's result) suggest that the invisible values are correctly processed internally.
I hope you will quickly point out my mistake(s).
Which version of Microsoft Visual Studio are you using?
The debugger is that supplied by Microsoft (with extensions provided by Intel).
I've experienced similar problems in older versions of IVF and MS VS. The "hack" solution I used in Debug Build was to declare a pointer to the item in conditional code at the top of the subroutine or function. Then use the pointer(s) in the Debugger to see the variable(s) of interest. With current versions of IVF you can use the ASSOCIATE construct of the language.
VS is that came with IVF as shown below. Other languages are not installed.
Here are the copies from VS about dialog.
Microsoft Visual Studio 2013 Shell (Integrated)
Version 12.0.21005.1 REL
Microsoft .NET Framework
installed version:IDE Standard
Intel® Advisor 2018 Update 3
Intel® Advisor 2018 Update 3, (build 558307), Copyright © 2009-2018 Intel Corporation. All rights reserved.
Intel® Inspector 2018 Update 3
Intel® Inspector 2018 Update 3, (build 558189), Copyright © 2009-2018 Intel Corporation. All rights reserved.
Intel® VTune™ Amplifier 2018 Update 3
Intel® VTune™ Amplifier 2018 Update 3, (build 559005), Copyright © 2009-2018 Intel Corporation. All rights reserved.
intel (R) Parallel Studio XE 2018 Update 3 Composer Edition for Fortran Windows* package ID: w_comp_lib_2018.3.210
intel (R) Parallel Studio XE 2018 Update 3 Composer Edition for Fortran Windows* Microsoft* Visual Studio* 2013 integrated
Jim, 1 week has passed since my last post, in response to your question about the version of VS. According to your comment #7, you experienced similar problem in the older versions of VS. It also implied that your latest VS works well. Therefore, please let me know the versions of VS you were once troubled with and that does show the threadprivate variables successfully, together with the build options, if you kindly already have had examined my code. I don't think that to verify whether my project runs or fails on your debugger is a very difficult task. Your advice about 'hack' -ing the debugger by embedding special codes is interesting, though I have not understood how to do it. However, this is the last option I would like to employ, since I want to concentrate myself on my research work. It is evident that the most powerful and attractive selling point of IVF integration into VS is its easiness-to-use and comfortable development environment, free from the special techniques that only limited number of experts can exploit. Without such wonderful capabilities, I would be forced to work like the way I had to do many years ago with MS fortran on MS DOS.
I am looking forward to seeing my mistakes spotted.
Thanks in advance,
I use MS VS 2013 (2012 with update to 2013). This version has issues with threadprivate variables in IVF. From my understanding the issue with threadprivate variable within the debugger is the resolution of the address of the variable requires execution of code
getBaseOfThreadPrivateArea() + offsetOfVariableOfInterest
When specifying a variable to inspect, the debugger is given only the offsetOfVariableOfInterest (the symbol known in the debug information within the execuitable). The method getBaseOfThreadPrivateArea() is not known to the MS Debugger, and it may vary depending on the tools used to produce the execuitable.
In the old days, Borland Turbo Debugger, one could specify for inspection and expression "anArbitraryUserFunction(&someVariable)". I do not know if (think) MS VS Debugger permits this. IOW the expression could include function calls to your code as well as values.
In my IVF code with threadprivate variables, I found it is easiest to create a single user defined type to contain all the thread private data. Then at the top of the subroutine, insert a pointer to this object.
! moduleWith_context type threadPrivate_t ... ! your stuff end type threadPrivate_t type(threadPrivate_t), pointer :: context => NULL() !$omp threadprivate(context) ... ! near top of program first parallel region !$omp parallel !once only initialization allocate(context) ... ! initialization code if any !$omp end parallel ... subroutine foo use moduleWith_context ... type(threadPrivate_t), pointer :: my_context .. my_context => context ! for debugging ... var = context%x ! Examine my_context%X in debugger
Jim, I'm trying your advice now.
Additional question. You said :
"The method getBaseOfThreadPrivateArea() is not known to the MS Debugger, and it may vary depending on the tools used to produce the executable."
Do you mean that getBaseOfThreadPrivateArea() is to be written in Intel, but no yet? Do you know someone (in Intel) who knows what tools are necessary or to be created, for the production of the desired executable?
>>Do you mean...
No. What I mean is, when you enter the symbolic name into the debugger, the debugger performs a symbol table lookup to produce an address. In this case the address isn't an address, but rather an offset into the threads thread private area, somewhat like a member variable offset. However, the base of the "object" (thread private area) is derivable via an application specific (generally inline) function. The debugger knows not what this function is (with possible exception to MS compiler and MS debugger). On Windows x64, MS software uses the GS segment register (selector). Other vendors on Windows x64 may adopt this or may not. Also the MS Debugger runs in a different process space, and thus (may)has a different mapping for GS than the selected thread of the application being debugged. As to if the MS debugger cah work around this, you will have to consult with MS. When the selector GS is used, the instruction encoding of the memory reference becomes
mov rax, qword ptr GS:[variableOffset]
Where the debugger is likely using
mov rax, qword ptr [variableOffset] ; no GS:, which implicitly means DS: (Data Selector)
Jim, because your advice #10 was a little bit not understandable for me ( pointer of pointer appeared), I modified the source and found it works well.
Let me summarize my understandings as of today. Please make comments if some of them are not correct.
list of threadprivate vars
in the routine,
!$OMP end critical
As you said, while threadprivate MyVars is not visible (undefined address), MyVarPointers displays the contents of MyVars successfully. It was necessary to reset the pointer every time before use (arrays get undefined occasionally), which looks somewhat strange because the array sizes (shape of type) do not change during the execution.
MyVarPointers also becomes invisible when it is declared as threadprivate. Therefore, it seems for me, as long as for the purpose of monitoring the invisible quantities, that the use of pointers is equivalent to the use of shared variables, except the heavy overhead of data transfer in the latter case. (Of course, such overhead should never be allowed in my actual program.)
One feature of using threadprivate is to eliminate the overhead of copying in private data, another is to have global thread context (so as to not pass a private context pointer/reference on CALL or function.