Optimizations dependent on availability of OpenMP threads

Jonathan_B_ · ‎01-07-2014

I have a working environment where $OMP_NUM_THREADS=1 is enforced (login node), but the system has many more available threads. It seems that when -O2 and -O3 optimizations are included in my compile command, the optimizations hard code instructions assuming OpenMP thread availability based on the host system, or at least the optimizations prevent the graceful handling of $OMP_NUM_THREADS. On execution, I get a segfault on entering __kmp_enter_single().

[bash]home$ ifort diagonalize.f90 -g -debug -assume buffered_io -ipo -fpic -openmp -O2 -I$MKLROOT/include/intel64/lp64 -I$MKLROOT/include $MKLROOT/lib/intel64/libmkl_blas95_lp64.a $MKLROOT/lib/intel64/libmkl_lapack95_lp64.a -Wl,--start-group $MKLROOT/lib/intel64/libmkl_intel_lp64.a $MKLROOT/lib/intel64/libmkl_core.a $MKLROOT/lib/intel64/libmkl_intel_thread.a -Wl,--end-group -lpthread -lm -o diagonalize
ipo: remark #11000: performing multi-file optimizations
ipo: remark #11006: generating object file /tmp/ipo_ifortuRFRzr.o
home$ diagonalize
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libiomp5.so        00002B52D2AFB71A Unknown               Unknown Unknown
libiomp5.so        00002B52D2ADEE16 Unknown               Unknown Unknown
diagonalize_16     0000000000411514 Unknown               Unknown Unknown
libiomp5.so        00002B52D2B20FE3 Unknown               Unknown Unknown
home$ idbc diagonalize
Intel(R) Debugger for applications running on Intel(R) 64, Version 13.0, Build [79.936.23]
------------------
object file name: diagonalize
Reading symbols from /home/me/diagonalize...done.
(idb) run
Starting program: /home/me/diagonalize
[New Thread 26293 (LWP 26293)]
Program received signal SIGSEGV
__kmp_enter_single () in /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so
(idb) where
#0 0x00002b92e998671a in __kmp_enter_single () in /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so
#1 0x00002b92e9969e16 in __kmpc_single () in /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so
#2 0x0000000000411514 in diagonalize () at /home/me/diagonalize.f90:403
#3 0x000000000040f6ed in diagonalize () at /home/me/diagonalize.f90:333
#4 0x000000000040dfec in main () in /home/me/diagonalize
#5 0x00000038cc81ecdd in __libc_start_main () in /lib64/libc-2.12.so
(idb) set $cmdset='dbx'
(idb) file diagonalize.f90
(idb) list 403,406
    403       !$omp workshare
    404         ! zero result array
    405         resArray(ptr(2):ptr(2)+dim_matrix-1) = 0_dbl_real
    406       !$omp end workshare
[/bash]

If I only use -O1 level optimizations, the program runs fine. However, this is a HPC environment and for real data sets I will need at least -O2 functioning.

Also, the segfault is usually present and rarely absent from this executable. I'm fairly certain it's due to varying load on the host system and thus, thread availability.

As a side note, the source works fine with gfortran and the GOMP library with -O2 and -O3 (main point is that there's nothing odd with this code).

Any suggestions?

Thanks,
Jonathan

Steven_L_Intel1 · ‎01-08-2014

I am doubtful that it is that optimizations assume multiple threads. There can be many possible explanations for this behavior. Can you provide us with a test program that demonstrates the problem.

jimdempseyatthecove · ‎01-08-2014

From your error traceback it seams as if the call to __kmp_enter_single caused a stack overflow.

Possible causes are:

a) too small of stack
b) "resArray" is a pointer/reference with stride other than 1 .AND. resArray(...) = attempting to create stack temporary such that __intel_fast_memset can be called.

Jim Dempsey

Jonathan_B_ · ‎01-09-2014

Thank you both for weighing in - your comments prompted me to step through the instructions to track down where the program fails.

[bash]stopped at [__kmp_enter_single0x00002ad8dc13b709]   movq 0x284cf8(%rip), %rax
(idb) stepi
stopped at [__kmp_enter_single0x00002ad8dc13b710]   movsxd %r12d, %rbx
(idb) stepi
stopped at [__kmp_enter_single0x00002ad8dc13b713]   movq (%rax), %rdx
(idb) stepi
stopped at [__kmp_enter_single0x00002ad8dc13b716]   movq (%rdx,%rbx,8), %rdx
(idb) stepi
stopped at [__kmp_enter_single0x00002ad8dc13b71a]   movq 0x40(%rdx), %rax
(idb)stepi
Thread received signal SEGV
stopped at [__kmp_enter_single0x00002ad8dc13b71a]   movq 0x40(%rdx), %rax[/bash]

So it's failing to move some data from memory to a register.

Unfortunately, I haven't had success in creating a simple test program that exhibits the behavior - I suspect I would need the heuristics to match the main program. I can provide the complete source and single dependency (with makefile), and a small input data file, but should I upload to the forum or submit by PM? All told, this amounts to 1 MB (compressed).

Thanks,
Jonathan

jimdempseyatthecove · ‎01-09-2014

From the code:

movq 0x284cf8(%rip), %rax // get address of global pointer to pointer
movsxd %r12d, %rbx // convert signed double index to qword
movq (%rax), %rdx // get address of pointer (indirect first reference)
movq (%rdx,%rbx,8), %rdx // get array[index] (containing pointer to object)
movq 0x40(%rdx), %rax // reference member variable at 0x40 offset from object (** error due to invalid address in rdx)

From the looks of what the code is doing I would say it is referencing internal tables of OpenMP in order to perform the enter of a Single section. My best guess is that some code earlier than the !$omp workshare corrupted the internal tables of OpenMP.

Look for an earlier piece of code that:

a) is indexing an array out of bounds on left side of =
b) using an uninitialized pointer on left side of =
c) using value as reference on left side of = (perhaps from return from call to C/C++ function)
d) Assuming returned pointer is valid when call takes error return (same as c)

Jim Dempsey

jimdempseyatthecove · ‎01-09-2014

On a hunch, try:

if(OMP_IN_PARALLEL() then
!$omp workshare
resArray(ptr(2):ptr(2)+dim_matrix-1) = 0_dbl_real
!$omp end workshare
else
resArray(ptr(2):ptr(2)+dim_matrix-1) = 0_dbl_real
endif

Jim Dempsey

Steven_L_Intel1 · ‎01-09-2014

I would suggest using Intel Premier Support to report the problem and attach sources, if Jim's suggestions don't help. But I suspect he is on the right track.

Jonathan_B_ · ‎01-13-2014

I tried the hunch you suggested Jim, but it seems that the parallel environment is established prior to the workshare section. Since the initialization is being converted to _intel_fast_memset, I tried changing the workshare directive to a single directive. No luck - segmentation fault in __kmp_enter_single(). I may try substituting _intel_fast_memset, but since it will have to be in an omp single environment I'm not convinced I would have different results.

I added -check bounds, and when that reported no error, I confirmed that all accesses to arrays were within bounds using idb.

I started to export all instructions from the initialization of the parallel environment onward, but stopped when I realized it exceeded 3000 lines. I can pick up assembly quickly enough, but that's too much material to work with right off the bat.

I'm going to do some more analysis and when I have more data I'll start a new thread. Thank you both for your input.

Jonathan

Martyn_C_Intel · ‎01-13-2014

When an application is built with -openmp, the OpenMP run-time library may create an additional monitor thread. So even if OMP_NUM_THREADS is set to 1, there may be more than one thread associated with the process. Could that be an issue on your login node?

Earlier, Jim mentioned stack size limits. Whilst that doesn't sound like the issue, -openmp does cause local arrays to be stored on the stack instead of on the heap. You could try -auto, which does the same thing without OpenMP, to see if this triggers an error, or you could increase the stack limit, e.g. with ulimit -s unlimited (bash & similar shells). There are some suggestions for debugging Fortran OpenMP applications at http://software.intel.com/en-us/articles/threading-fortran-applications-for-parallel-performance-on-multi-core-systems/ .

Finally, note that in the current Intel compiler, WORKSHARE is implemented with a single thread, i.e., there is no real sharing of work between threads, even when OMP_NUM_THREADS > 1. That's why you see a call to __kmp_enter_single. So for the current Intel compiler, there's little point in coding an OpenMP PARALLEL WORKSHARE construct, unless you want that for other platforms or compilers. Removing it might allow you to workaround the immediate problem, for no loss of performance, especially if the problem really is related just to the WORKSHARE implementation.

A multi-threaded version of WORKSHARE may be implemented in a future compiler.

Jonathan_B_ · ‎01-14-2014

Hi Martyn, thanks for your input. I do not know the extent of measures used to restrict threads available to users on the login nodes of this system. Since I've successfully run other OpenMP parallel test programs previously, through, I don't think the monitor thread is to blame. I just noticed issues in a previous forum thread where
<code>do i=1,omp_get_num_threads()
...
end do</code>
did not check the loop bounds in a parallel construct, but
<code>do i=omp_get_thread_num()+1, omp_get_num_threads(), omp_get_num_threads()
...
end do</code>
functioned fine (also in this program, solution suggested by Jim Dempsey). This behavior led me to believe that instructions for the number of available threads on the system were written into the assembly, but perhaps conditionally executed (and with optimizations enabled, the conditionals were not functioning properly). Regardless, this was the original impetus for my concerns regarding thread number.

In my program, all arrays are dynamically allocated (all large), so it's not possible for them to be on the stack to the best of my knowledge. Using -auto rather than -openmp produces a fully functioning binary, so the stack size limit is not the issue. I'm using the source code in two environments - the compute server is Xeon based and I'm using the Intel Fortran compiler there; my workstation has an AMD chip, so I'm using gfortran there. I tried changing the workshare directive to a single directive and had the same results (segmentation fault). Since the array is shared, the initialization instruction(s) have to be in either a workshare or single construct. It's not feasible to exit and reenter the parallel environment because this is running in a loop until an exit flag is triggered (and there are some large thread private allocated arrays that would have to be reallocated each time). I will look at the link you provided in depth.

Thanks,
Jonathan

Martyn_C_Intel · ‎01-14-2014

My only other thought is that you are linking to the threaded version of MKL. Perhaps, in this context of login nodes, you should either link to the serial version, or else set an MKL environment variable to limit MKL to a single thread, though I don't see why that would impact this particular workshare or single construct. Otherwise, if you don't spot any other issues, we'd probably need to see an example, as Steve suggested.

Of the suggestions in that link, trying Intel Inspector XE might be the most hopeful.

jimdempseyatthecove · ‎01-14-2014

Jonathan,

For diagnostic purposes create a subroutine along this line

subroutine BoinkTest
use omp
if(.NOT. omp_in_parallel()) then
!$omp parallel
call BoinkTest
!$omp end parallel
else
!$omp workshare
aFooArray = 0.0
!$omp end workshare
endif
endsubroutine

The insert calls to this subroutine throughout your program (reverse binary search if run to crash is fast).

The object of the procedure is to narrow the search of the section of code causing the problem (presumed corruption of OpenMP data).

Jim Dempsey

Jonathan_B_ · ‎01-15-2014

I cannot be sure what caused the change in behavior, but I made a few modifications and the resulting binary seems to be free of the segmentation fault. Following Martyn's recommendation, I added '–diag-enable sc-parallel3' to the compile line, and saw a slew of errors and warnings - some valid and some not. Since I have a large parallel construct with multiple single constructs within and a main parallel do construct, it seems that the error checking does not follow the state of variables through the entire scope of a parallel construct, otherwise the warnings about deallocating a private array before leaving the parallel construct would not be printed. Still, the following interested me:

<plain>warning #12278: there is a case where this worksharing construct is not enclosed dynamically within a parallel region in order to execute in parallel</plain>

Considering that, I thought I would try changing all error catching code that terminates immediately to use an exit flag and jump to the end of the parallel construct, halting after the dynamic parallel region. That seems to have solved the problem. Am I mistaken that issuing 'stop' is acceptable in a parallel region?

I'm going to try Jim's suggestion on the older version to check if the instructions associated with the error checking code cause the problem. If this is an API violation I was unaware of, then I'll refrain from doing it in the future. Otherwise, this may help me build a minimal code to demonstrate the problem.

Thanks,
Jonathan

jimdempseyatthecove · ‎01-15-2014

Jonathan,

If the warning #12278 was correct in that the workshare statement could be entered outside a parallel region then my suggestion may work, however, it also might not suppress the warning message.

Have you considered changing workshare to sections?

(this will not fix the execution from outside a parallel region, but it will permit some degree of parallelization)

Jim Dempsey

Jonathan_B_ · ‎01-15-2014

Jim, the warning message was only thrown with '–diag-enable sc-parallel3' present, and since that is greedy about throwing errors and warnings, I removed it from the compile line. It was very informative at pointing out this issue, but I've double checked the OpenMP specs and I'm pretty sure that my usage was not an API violation (section 1.2.2, line 20 "STOP statements are allowed in a structured block"). Still, this pointed out at least one issue - the implementation of the workshare directive.

In this round of testing, I included the following subroutine:
[fortran]recursive Subroutine WorkshareTest(testvector,extent,begin,end)
use Hamiltonian           ! dbl_real kind number defined here
use omp_lib
integer, intent(in) :: extent,begin,end
real(kind=dbl_real), dimension(extent), intent(inout) :: testvector
if (.not. omp_in_parallel()) then
!$omp parallel shared(testvector,extent,begin,end)
    call WorkshareTest(testvector,extent,begin,end)
!$omp end parallel
else
!$omp workshare
      testvector(begin:end) = 0.0
!$omp end workshare
end if
end Subroutine WorkshareTest[/fortran]

The call to this subroutine is:
[fortran]call WorkshareTest(workd,size(workd),ipntr(2),ipntr(2)+dim_hamiltonian-1)[/fortran]
where:
o workd is the array in original workshare (space for vectors v and y used in matrix-vector product y = A*v)
o ipntr is an array with various runtime values; element 2 holds the index of the first element in workd that is vector y
o dim_hamiltonian is the dimension of my matrix, also the dimension of arrays v and y

I inserted calls in between all lines after workd was allocated all the way through the parallel section. There were no errors reported by ifort. I inserted a break in idb at the first call to WorkshareTest. The following steps through the code from that point on, and includes a backtrace when the segmentation fault happens.

[plain][1] stopped at [diagonalize:311, 0x000000000040ed6e] call WorkshareTest(workd,size(workd),ipntr(2),ipntr(2)+dim_hamiltonian-1)
(idb) step
stopped at [worksharetest:833, 0x000000000040ed92] recursive Subroutine WorkshareTest(testvector,extent,begin,end)
(idb) step
[1] stopped at [diagonalize:311, 0x000000000040ed9b] call WorkshareTest(workd,size(workd),ipntr(2),ipntr(2)+dim_hamiltonian-1)
(idb) step
stopped at [worksharetest:833, 0x000000000040eda6] recursive Subroutine WorkshareTest(testvector,extent,begin,end)
(idb) step
stopped at [diagonalize:311, 0x000000000040edad] call WorkshareTest(workd,size(workd),ipntr(2),ipntr(2)+dim_hamiltonian-1)
(idb) step
stopped at [worksharetest:833, 0x000000000040edb4] recursive Subroutine WorkshareTest(testvector,extent,begin,end)
(idb) step
stopped at [diagonalize:309, 0x000000000040edc5] ipntr(2)=dim_hamiltonian+1
(idb) step
[1] stopped at [diagonalize:311, 0x000000000040edcf] call WorkshareTest(workd,size(workd),ipntr(2),ipntr(2)+dim_hamiltonian-1)
(idb) step
stopped at [worksharetest:833, 0x000000000040edd9] recursive Subroutine WorkshareTest(testvector,extent,begin,end)
(idb) step
stopped at [worksharetest:838, 0x000000000040ee26] if (.not. omp_in_parallel()) then
(idb) step
stopped at [worksharetest:838, 0x000000000040ee46] if (.not. omp_in_parallel()) then
(idb) step
stopped at [worksharetest:839, 0x000000000040ee5e] !$omp parallel shared(testvector,extent,begin,end)
(idb) step
Thread received signal SEGV
The "step" command was not completed.
stopped at [__intel_new_memset0x00002b74a881ffb4]
(idb) where
>0 __intel_new_memset(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a881ffb4]
1 _intel_fast_memset.J(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a8816cb6]
2 ___kmp_allocate(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87a8fe6]
3 __kmpc_serialized_parallel(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87bbfc0]
4 __kmp_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87dba2e]
5 worksharetest_(...) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":839, 0x00000000004155e1]
6 worksharetest($01=, $02=, $03=, $04=) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":840, 0x00000000004156bb]
7 worksharetest_(...) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":839, 0x00000000004155e1]
8 worksharetest($01=, $02=, $03=, $04=) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":840, 0x00000000004156bb]
9 __kmp_invoke_microtask(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87fefe3]
10 __kmp_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87dbadb]
11 __kmpc_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87bbbf8]
12 worksharetest_(...) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":839, 0x00000000004155e1]
13 worksharetest($01=, $02=, $03=, $04=) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":840, 0x00000000004156bb]
14 __kmp_invoke_microtask(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87fefe3]
15 __kmp_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87dbadb]
16 __kmpc_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87bbbf8]
17 worksharetest_(...) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":839, 0x00000000004155e1]
18 worksharetest($01=, $02=, $03=, $04=) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":840, 0x00000000004156bb]
19 __kmp_invoke_microtask(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87fefe3]
20 __kmp_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87dbadb]
21 __kmpc_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87bbbf8]
22 worksharetest_(...) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":839, 0x00000000004155e1]
23 worksharetest($01=, $02=, $03=, $04=) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":840, 0x00000000004156bb]
24 __kmp_invoke_microtask(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87fefe3]
25 __kmp_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87dbadb]
26 __kmpc_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87bbbf8]
27 worksharetest_(...) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":839, 0x00000000004155e1]
28 worksharetest($01=, $02=, $03=, $04=) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":840, 0x00000000004156bb]
29 __kmp_invoke_microtask(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87fefe3]
30 __kmp_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87dbadb]
31 __kmpc_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87bbbf8]
32 worksharetest_(...) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":839, 0x00000000004155e1]
33 worksharetest($01=, $02=, $03=, $04=) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":840, 0x00000000004156bb]
34 __kmp_invoke_microtask(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87fefe3]
35 __kmp_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87dbadb]
36 __kmpc_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87bbbf8]
37 worksharetest_(...) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":839, 0x00000000004155e1]
38 worksharetest($01=, $02=, $03=, $04=) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":840, 0x00000000004156bb]
39 __kmp_invoke_microtask(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87fefe3]
40 __kmp_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87dbadb]
41 __kmpc_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87bbbf8]
42 worksharetest_(...) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":839, 0x00000000004155e1]
43 worksharetest($01=, $02=, $03=, $04=) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":840, 0x00000000004156bb]
44 __kmp_invoke_microtask(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87fefe3]
45 __kmp_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87dbadb]
46 __kmpc_fork_call(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87bbbf8]
47 worksharetest_(...) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":839, 0x00000000004155e1]
48 worksharetest($01=, $02=, $03=, $04=) ["/home1/02146/jblair42/src/diagonalize_parallel_csr_sym.f90":840, 0x00000000004156bb]
49 __kmp_invoke_microtask(...) ["/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so": 0x00002b74a87fefe3][/plain]

Why my modification to use an exit flag as opposed to issuing 'stop' in the error checking code worked is not yet determined. However, I want to double check that the subroutine is standards-complying before using it in a minimal test case.

Thank you all for helping with this. Whether this turns out to be one or two bugs (or one bug and something else), I'm finally getting some resolution on a four month old issue.

Best,
Jonathan

Martyn_C_Intel · ‎01-15-2014

Yes, STOP is allowed within a structured block in a parallel region. I also tried a little test case with this, and it seemed to work fine.

I hadn't realized you were talking about a recursive subroutine, for which the initial instance is (presumably) not in a parallel region, but subsequent ones are. That seems like a whole extra layer of complication. Perhaps you should be building with the -recursive switch, though I suspect that compiling and linking with -openmp may be sufficient. Do you see seg faults if the subroutine is not recursive? Your traceback seems to show more recursions than I would expect from your code. I'm not sure I've thought this through fully, but isn't the subroutine WorkshareTest going to get called recursively by each thread, each of which then tries to zero the array in a workshare construct? With a "single" implementation, does that mean that only one of the recursive calls will do any zeroing? I'm not clear how this will work.

I would start by printing out the result of omp_in_parallel(), and if true, also the result of omp_get_thread_num and omp_get_num_threads, for each call, both outside and inside the workshare construct.

.

jimdempseyatthecove · ‎01-16-2014

Martyn,

-openmp implicitly sets -recursive

Also, presumably if the outer most level of WorkshareTest is called from within a parallel region that the programmer (Jonathan) would provide different and non-overlapping begin and end values for each thread.

Jonathan,

The SEGV occurred at the start of a parallel region (on the !$omp parallel...)

For this to occur there, either a) memory corrupted as described in #5, b) insufficient stack, or c) your application is out of virtual memory.

Insufficient stack for a thread can surprisingly be caused by specifying "unlimited" (only one thread can rightfully claim "unlimited").
Insufficient stack can also be caused by not setting a sufficiently large enough stack for use by each of all threads (while not consuming all of virtual memory in combination with code, heap and static data).
Insufficient stack - or rather - insufficient virtual memory can be caused by too small of page file (swap file).
Insufficient stack can be caused by using stack when heap should be used instead.

An interesting diagnostic may be to declare and define a local variable in WorkshareTest and print out the LOC(localVariable). This will track the stack pointer (of the context of the thread making the call).

Jim Dempsey