Hello, I am writting a hybrid openmp/MPI program which will end up calling SCLAPACK routines from threaded regions (to solve independent problems similtanously) , but am unable to use the BLACS send/recieves from a parallel region without a high (>90%) crash rate. I have successfully used the thread safe INTEL MPI library to do multiple threaded MPI send/recieves. Three type of error comes up for the same code.
- Stall in the 'parallel blacs' region
- segfault in libmkl_blacs ... + libpthread ( see snippet 1)
- MPI error (see snippet 2)
[bash] xxx@yyy:~/scalapack> ...
PARALLEL BLACS : 1
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread.so.0 00002ADF44483C10 Unknown Unknown Unknown
libmkl_blacs_inte 00002ADF4425C6BB Unknown Unknown Unknown [/bash]
PARALLEL BLACS : 1
Fatal error in MPI_Testall: Invalid MPI_Request, error stack: MPI_Testall(261): MPI_Testall(count=2, req_array=0x6107b0, flag=0x7fff4ce1cd80, status_array=0x60fe80) failed MPI_Testall(123): The supplied request in array element 1 was invalid (kind=15) APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1) [/bash]
The source code is attached, with the compiler line and the relevent parts of the PBS submission as a footer.
I am using the Intel Cluster Studio 2012, ifort 12.1.0, Intel MPI version 4.0, update 3 - i Am unsure of which MKL i have (came with cluster studio). I suspect this is a threading issue since the program works with OMP_NUM_THREADS=1.
Any help would be greatly appreciated!
[fortran] ! TEST THE PARALLEL BLACS do i = 1, 100 !$OMP PARALLEL CALL DGESD2D(CONTXT(THREADNUM), 10, 1, SEND, 10, 0, MOD(COL(THREADNUM)+1, MPI_PROCS)) CALL DGERV2D(CONTXT(THREADNUM), 10, 1, RECV, 10, 0, MOD(MPI_PROCS + COL(THREADNUM) - 1, MPI_PROCS)) write(*,*) 'PARALLEL BLACS : ', i !$OMP END PARALLEL end do [/fortran]The hope was that each threadnum = OMP_GET_THREAD_NUM() would have an associated BLACS context - so that the same threadnum on each MPI process would form a communcation group - like having used the threadnum as the communcation ID for the MPI send/recieves - and that multiple send/recieves could be done in parallel over a different context for each thread. In this case sending to the next (cyclic) column of the context. Replacing the shared send/recv arrays by the threadprivate psend/precv in the parallel blacs region results in the same set of errors. Sorry to not have that in the version i posted, I was stuffing around with the program trying to pin the error! To answer your questions : Why should it work? From what i can gather a BLACS communcation invokes a set of MPI commands for communcation between MPI processes. So my post is really about if a thread-safe MPI library is avaliable is a thread-safe BLACS (and thus scalapack) library also possible/avaliable (with the hope that i havent made some ridiculous coding error! x.X)? This brings up the question of how the BLACS (and scalapack) libraries in the MKL were built by the installer. Why SEND, RECV arrays are used? Sorry again for the edited version! replacing these with PSEND/PRECV doesnt seem to help! Who sends/recieves data in parallel? the threads with the same threadnum = OMP_GET_THREAD_NUM() on each MPI process sending to the next cyclic column in the BLACS contxt - one context for each threadnum. One last comment : This program has hard coded a maximum of 2 openmp threads per node! (this was to do with the cluster i was running it on) Thanks again! Andrew