Hi, I was able to reproduce

andrews_ · ‎10-26-2012

Hello, I am writting a hybrid openmp/MPI program which will end up calling SCLAPACK routines from threaded regions (to solve independent problems similtanously) , but am unable to use the BLACS send/recieves from a parallel region without a high (>90%) crash rate. I have successfully used the thread safe INTEL MPI library to do multiple threaded MPI send/recieves. Three type of error comes up for the same code.

Stall in the 'parallel blacs' region
segfault in libmkl_blacs ... + libpthread ( see snippet 1)
MPI error (see snippet 2)

[bash] xxx@yyy:~/scalapack> ...

PARALLEL BLACS : 1

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image PC Routine Line Source

libpthread.so.0 00002ADF44483C10 Unknown Unknown Unknown

libmkl_blacs_inte 00002ADF4425C6BB Unknown Unknown Unknown [/bash]

[bash] ...

PARALLEL BLACS : 1

Fatal error in MPI_Testall: Invalid MPI_Request, error stack: MPI_Testall(261): MPI_Testall(count=2, req_array=0x6107b0, flag=0x7fff4ce1cd80, status_array=0x60fe80) failed MPI_Testall(123): The supplied request in array element 1 was invalid (kind=15) APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1) [/bash]

The source code is attached, with the compiler line and the relevent parts of the PBS submission as a footer.

I am using the Intel Cluster Studio 2012, ifort 12.1.0, Intel MPI version 4.0, update 3 - i Am unsure of which MKL i have (came with cluster studio). I suspect this is a threading issue since the program works with OMP_NUM_THREADS=1.

Any help would be greatly appreciated!

Thanks,

Andrew

barragan_villanueva_ · ‎10-26-2012

Hi, I was able to reproduce your problem. Link-line: mpiifort -mt_mpi -openmp -I$MKLROOT/include -check bounds -traceback -g blacs.f90 -o blacs -L$MKLROOT/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm Run: OMP_NUM_THREADS=1 LD_LIBRARY_PATH=$MKLROOT/lib/intel64:$LD_LIBRARY_PATH mpirun -np 8 ./blacs ... BLACS PARALLEL COMPLETED ALL 1000 TRIALS But with: % OMP_NUM_THREADS=2 LD_LIBRARY_PATH=$MKLROOT/lib/intel64:$LD_LIBRARY_PATH mpirun -np 2 ./blacs MPI INITIALISATION : 3 3 0 0 MPI INITIALISATION : 3 3 0 0 MPI RANK : 1 / 2 MPI RANK : 0 / 2 MPI_SEND : 0 0 MPI_SEND : 0 0 MPI_RECV : 0 0 MPI_RECV : 0 0 MPI PARALLEL TEST COMPLETED MPI PARALLEL TEST COMPLETED BLACS SETUP : 0 0 0 0 BLACS SETUP : 0 1 0 1 BLACS SETUP : 1 0 0 0 BLACS SETUP : 1 1 0 1 BLACS SINGLE COMPLETE : 0 1 BLACS SINGLE COMPLETE : 0 0 BLACS SINGLE COMPLETE : 1 0 BLACS SINGLE COMPLETE : 1 1 Fatal error in MPI_Testall: Invalid MPI_Request, error stack: MPI_Testall(261): MPI_Testall(count=2, req_array=0x23241b0, flag=0x7fffde64c190, status_array=0x2323900) failed MPI_Testall(124): The supplied request in array element 1 was invalid (kind=15) Fatal error in MPI_Testall: Invalid MPI_Request, error stack: MPI_Testall(261): MPI_Testall(count=2, req_array=0x151701b0, flag=0x41a9bd10, status_array=0x1516f880) failed MPI_Testall(124): The supplied request in array element 1 was invalid (kind=15) However with: OMP_NUM_THREADS=8 LD_LIBRARY_PATH=$MKLROOT/lib/intel64:$LD_LIBRARY_PATH mpirun -np 8 ./blacs ... PARALLEL BLACS : 1 forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 5 which is greater than the upper bound of 1 forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 2 which is greater than the upper bound of 1 forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 4 which is greater than the upper bound of 1 forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 5 which is greater than the upper bound of 1 forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 4 which is greater than the upper bound of 1 forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 2 which is greater than the upper bound of 1 forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 6 which is greater than the upper bound of 1 forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 6 which is greater than the upper bound of 1 forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 2 which is greater than the upper bound of 1 forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 4 which is greater than the upper bound of 1 forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 3 which is greater than the upper bound of 1 forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 6 which is greater than the upper bound of 1 forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 3 which is greater than the upper bound of 1 Fatal error in MPI_Testall: Invalid MPI_Request, error stack: MPI_Testall(261): MPI_Testall(count=2, req_array=0x1bf483b0, flag=0x41427d10, status_array=0x1bf47880) failed MPI_Testall(124): The supplied request in array element 1 was invalid (kind=15) I

andrews_ · ‎10-28-2012

Hey Victor, Thanks for the reply! Do you know if there is any way around this problem without compiling a seperate BLACS (and then SCALAPACK) (trying to link against the thread-safe MPI) ? I figure another possbile way around this would be to spawn the same number of MPI processes as there are cores and distribute the data for the SCALAPACK routines across these just before the solve - making seperate contexts for each set of OMP threads. The primary question concerning this would be what is the relative cost of communication between MPI processes on the same physical computer node, and those that require the network? Thanks, Andrew

barragan_villanueva_ · ‎10-29-2012

Andrew, After looking at your code I see correct fragment for serial BLACS testing: ! TEST THE SERIAL BLACS do i = 0, 1 CALL DGESD2D(CONTXT(i), 10, 1, SEND, 10, 0, MOD(COL(i)+1, MPI_PROCS)) CALL DGERV2D(CONTXT(i), 10, 1, RECV, 10, 0, MOD(MPI_PROCS + COL(i) - 1, MPI_PROCS)) WRITE(*,*) 'BLACS SINGLE COMPLETE : ', i, col(i) CALL BLACS_BARRIER(CONTXT(i), 'A') end do But next parallel code fragment for BLACS is unclear for me: what actions are supposed to be done in parallel: ! TEST THE PARALLEL BLACS do i = 1, 100 !$OMP PARALLEL CALL DGESD2D(CONTXT(THREADNUM), 10, 1, SEND, 10, 0, MOD(COL(THREADNUM)+1, MPI_PROCS)) CALL DGERV2D(CONTXT(THREADNUM), 10, 1, RECV, 10, 0, MOD(MPI_PROCS + COL(THREADNUM) - 1, MPI_PROCS)) write(*,*) 'PARALLEL BLACS : ', i !$OMP END PARALLEL end do Questions: Why should it work? Why SEND, RECV arrays are used? Who sends/recieves data in parallel? Also defined PSEND, PRECV are not used here

andrews_ · ‎10-29-2012

Hey Victor, For this code chunk

[fortran] ! TEST THE PARALLEL BLACS
do i = 1, 100
!$OMP PARALLEL
CALL DGESD2D(CONTXT(THREADNUM), 10, 1, SEND, 10, 0, MOD(COL(THREADNUM)+1, MPI_PROCS))
CALL DGERV2D(CONTXT(THREADNUM), 10, 1, RECV, 10, 0, MOD(MPI_PROCS + COL(THREADNUM) - 1, MPI_PROCS))
write(*,*) 'PARALLEL BLACS : ', i
!$OMP END PARALLEL
end do
[/fortran]

The hope was that each threadnum = OMP_GET_THREAD_NUM() would have an associated BLACS context - so that the same threadnum on each MPI process would form a communcation group - like having used the threadnum as the communcation ID for the MPI send/recieves - and that multiple send/recieves could be done in parallel over a different context for each thread. In this case sending to the next (cyclic) column of the context. Replacing the shared send/recv arrays by the threadprivate psend/precv in the parallel blacs region results in the same set of errors. Sorry to not have that in the version i posted, I was stuffing around with the program trying to pin the error! To answer your questions : Why should it work? From what i can gather a BLACS communcation invokes a set of MPI commands for communcation between MPI processes. So my post is really about if a thread-safe MPI library is avaliable is a thread-safe BLACS (and thus scalapack) library also possible/avaliable (with the hope that i havent made some ridiculous coding error! x.X)? This brings up the question of how the BLACS (and scalapack) libraries in the MKL were built by the installer. Why SEND, RECV arrays are used? Sorry again for the edited version! replacing these with PSEND/PRECV doesnt seem to help! Who sends/recieves data in parallel? the threads with the same threadnum = OMP_GET_THREAD_NUM() on each MPI process sending to the next cyclic column in the BLACS contxt - one context for each threadnum. One last comment : This program has hard coded a maximum of 2 openmp threads per node! (this was to do with the cluster i was running it on) Thanks again! Andrew

threadsafe BLACS/SCALAPACK