- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I was developing a new software using scaLAPACK/BLACS MKL implementation. In my small testing cluster everything worked ok. However, when I have moved to the 'big' cluster to start computations, the program sometimes fails with error (repeated for different threads):
Rank 38 [Thu Apr 17 16:08:18 2014] [c7-1c1s8n3] Fatal error in MPI_Recv: Invalid tag, error stack:
MPI_Recv(192): MPI_Recv(buf=0x3847640, count=64, MPI_INT, src=37, tag=5000000, comm=0x84000004, status=0x7fffffff7418) failed
MPI_Recv(113): Invalid tag, value is 5000000
I am using the compiler and libraries in Intel Composer XE Edition 2013 SP1. The cluster is based on Cray XC30 series and uses their own MPI implementation. I have read this MPI implementation have their own limits for 'tag' parameter (I tried different versions). However I have read 'tag' parameter limits are higher than 5000000.
Sombody received the same error? Is it related to MKL implementation or MPI implementation?
Many thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, Oriol C.
Cluster components of MKL use MPI functions through BLACS. For binary compatibility with MPI implementations there are several libs in MKL: libmkl_blacs_intelmpi_lp64 for Intel MPI, libmkl_blacs_openmpi_lp64 for OpenMPI, etc.
As far as I know, on Cray cluster MVAPICH MPI is used (with some modifications). MVAPICH2 should be compatible with MPICH2, as well as Intel MPI. So I suppose for XC30 you should link your app with libmkl_blacs_intelmpi_lp64 or libmkl_blacs_intelmpi_ilp64.
What MPI do you use in your testing cluster? How do you link BLACS? Do you recompile your application on XC30 or just copy executable files?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, Oriol C.
Cluster components of MKL use MPI functions through BLACS. For binary compatibility with MPI implementations there are several libs in MKL: libmkl_blacs_intelmpi_lp64 for Intel MPI, libmkl_blacs_openmpi_lp64 for OpenMPI, etc.
As far as I know, on Cray cluster MVAPICH MPI is used (with some modifications). MVAPICH2 should be compatible with MPICH2, as well as Intel MPI. So I suppose for XC30 you should link your app with libmkl_blacs_intelmpi_lp64 or libmkl_blacs_intelmpi_ilp64.
What MPI do you use in your testing cluster? How do you link BLACS? Do you recompile your application on XC30 or just copy executable files?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Evarist,
Thanks for your reply.
I recompile my application on XC30. On my testing cluster I am using OpenMPI. I use GNU compiler and I am linking dynamically OpenMPI library (mkl_blacs_openmpi_lp64):
mpic++ -g -I /opt/intel/mkl/include -fopenmp -m64 -DUNIX -o executable file1.o file2.o ... -L/opt/intel/mkl/lib/intel64/ -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_core -lmkl_gnu_thread -lmkl_blacs_openmpi_lp64 -ldl -lpthread -lm -Wl,-rpath=/opt/intel/mkl/lib/intel64/
On XC30, I am using Intel compiler and I am linking statically libmkl_blacs_intelmpi_ilp64:
CC -I /opt/intel/composerxe/mkl/include -openmp -O3 -DUNIX -o executable file1.o file2.o ... /opt/intel/composerxe/mkl/lib/intel64/libmkl_scalapack_lp64.a -Wl,--start-group /opt/intel/composerxe/mkl/lib/intel64/libmkl_intel_lp64.a /opt/intel/composerxe/mkl/lib/intel64/libmkl_core.a /opt/intel/composerxe/mkl/lib/intel64/libmkl_intel_thread.a -Wl,--end-group /opt/intel/composerxe/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.a -lpthread -lm
If I understand well the documentation, the XC30 uses MPT which have their own implementation of MPI.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you, Oriol, for sharing information.
And I am sorry, I didn't catch - does recompilation solve the problem?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I apologize, sometimes I am not very good explaining in english.
I always recompiled on both clusters (i.e. recompilation does not solve the problem).
Many thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hmm, unfortunately I have no ideas why this happens. MKL Scalapack should not use such a big tag number.
Could you please try to link with other blacs libraries? Probably sgimpt is what I would try first...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, many thanks for your help!
I will try first with libmkl_blacs_sgimpt_lp64 (and mkl_blacs_openmpi_lp64). I will post if this solve the problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am seeing the same invalid tag value with scalapack's PDGESV on cray XC30,
MPI_Recv(192): MPI_Recv(buf=0x30a0d50, count=2, MPI_INT, src=8, tag=5000000, comm=0x84000004, status=0x7fffffff0d38) failed
MPI_Recv(113): Invalid tag, value is 5000000
The situation is very similar. On our small cluster I compile with mpiifort and -mkl=cluster, and on XC30 I use the cray ftn wrapper for ifort but with -mkl=cluster as well. It never crashed like this on my home cluser, and it seems to be only affecting certain calculations, maybe the ones where pdgsev is used instead of pzgesv...
I don't think that compiling with a different blacs will solve my problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have tried with other BLACS libraries without success (I am unable of use many of them).
After contacting with IT of the XC30, I provided him with a "simple" test version. In my case, the problem seems to appear when the program inverts a matrix using pdgetrf and pdgetri functions. They have told me that seems a bug in MKL and they recommend me to change to their own libraries (libsci). However, I am using some MKL specific functions and I would like to continue using MKL.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Oriol and Ariel for pointing the functions!
I will try to reproduce the problem using these functions and come back as soon as I will get any updates!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have also been experiencing this issue. Has any progress been made?
Many thanks,
Joly
Info:
Cray XC30
MPICH2 (cray-mpich/6.3.1)
Intel Compiler (intel/14.0.1.106) and MKL
Image Rank 39 [Fri Jun 27 10:29:28 2014] [c0-1c2s12n0] Fatal error in MPI_Recv: Invalid tag, error stack:
MPI_Recv(192): MPI_Recv(buf=0x51f8de0, count=1, MPI_INT, src=38, tag=5000000, comm=0xc4000001, status=0x7fffffff12b8) failed
MPI_Recv(113): Invalid tag, value is 5000000
Backtrace:
...
...
Dense invert : line 1138
pdgetrf
mpl_lu
mpl_lu_nb
mpl_pivot_comm
MKL_RECV
PMPI_Recv
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As long as I know, no progress has been made.
On Cray XC30, I have to use libsci replacement when I need these functions.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page