Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
84 Views

Fatal error in MPI_Recv: Invalid tag

Jump to solution

Hello,

I was developing a new software using scaLAPACK/BLACS MKL implementation. In my small testing cluster everything worked ok. However, when I have moved to the 'big' cluster to start computations, the program sometimes fails with error (repeated for different threads):

Rank 38 [Thu Apr 17 16:08:18 2014] [c7-1c1s8n3] Fatal error in MPI_Recv: Invalid tag, error stack:
MPI_Recv(192): MPI_Recv(buf=0x3847640, count=64, MPI_INT, src=37, tag=5000000, comm=0x84000004, status=0x7fffffff7418) failed
MPI_Recv(113): Invalid tag, value is 5000000

I am using the compiler and libraries in Intel Composer XE Edition 2013 SP1. The cluster is based on Cray XC30 series and uses their own MPI implementation. I have read this MPI implementation have their own limits for 'tag' parameter (I tried different versions). However I have read 'tag' parameter limits are higher than 5000000.

Sombody received the same error? Is it related to MKL implementation or MPI implementation?

Many thanks!

0 Kudos

Accepted Solutions
Highlighted
84 Views

Hello, Oriol C.

Cluster components of MKL use MPI functions through BLACS. For binary compatibility with MPI implementations there are several libs in MKL: libmkl_blacs_intelmpi_lp64 for Intel MPI, libmkl_blacs_openmpi_lp64 for OpenMPI, etc.

As far as I know, on Cray cluster MVAPICH MPI is used (with some modifications). MVAPICH2 should be compatible with MPICH2, as well as Intel MPI. So I suppose for XC30 you should link your app with libmkl_blacs_intelmpi_lp64 or libmkl_blacs_intelmpi_ilp64.

What MPI do you use in your testing cluster? How do you link BLACS? Do you recompile your application on XC30 or just copy executable files?

View solution in original post

0 Kudos
11 Replies
Highlighted
85 Views

Hello, Oriol C.

Cluster components of MKL use MPI functions through BLACS. For binary compatibility with MPI implementations there are several libs in MKL: libmkl_blacs_intelmpi_lp64 for Intel MPI, libmkl_blacs_openmpi_lp64 for OpenMPI, etc.

As far as I know, on Cray cluster MVAPICH MPI is used (with some modifications). MVAPICH2 should be compatible with MPICH2, as well as Intel MPI. So I suppose for XC30 you should link your app with libmkl_blacs_intelmpi_lp64 or libmkl_blacs_intelmpi_ilp64.

What MPI do you use in your testing cluster? How do you link BLACS? Do you recompile your application on XC30 or just copy executable files?

View solution in original post

0 Kudos
Highlighted
Beginner
84 Views

Hello Evarist,

Thanks for your reply.

I recompile my application on XC30. On my testing cluster I am using OpenMPI. I use GNU compiler and I am linking dynamically OpenMPI library (mkl_blacs_openmpi_lp64):

mpic++ -g -I /opt/intel/mkl/include -fopenmp -m64  -DUNIX -o executable file1.o file2.o ... -L/opt/intel/mkl/lib/intel64/ -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_core -lmkl_gnu_thread -lmkl_blacs_openmpi_lp64 -ldl -lpthread -lm -Wl,-rpath=/opt/intel/mkl/lib/intel64/

On XC30, I am using Intel compiler and I am linking statically libmkl_blacs_intelmpi_ilp64:

CC -I /opt/intel/composerxe/mkl/include -openmp -O3  -DUNIX -o executable file1.o file2.o ... /opt/intel/composerxe/mkl/lib/intel64/libmkl_scalapack_lp64.a -Wl,--start-group /opt/intel/composerxe/mkl/lib/intel64/libmkl_intel_lp64.a /opt/intel/composerxe/mkl/lib/intel64/libmkl_core.a /opt/intel/composerxe/mkl/lib/intel64/libmkl_intel_thread.a -Wl,--end-group /opt/intel/composerxe/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.a -lpthread -lm

If I understand well the documentation, the XC30 uses MPT which have their own implementation of MPI.

 

 

 

0 Kudos
Highlighted
84 Views

Thank you, Oriol, for sharing information.

And I am sorry, I didn't catch - does recompilation solve the problem?

0 Kudos
Highlighted
Beginner
84 Views

I apologize, sometimes I am not very good explaining in english.

I always recompiled on both clusters (i.e. recompilation does not solve the problem).

Many thanks

0 Kudos
Highlighted
84 Views

Hmm, unfortunately I have no ideas why this happens. MKL Scalapack should not use such a big tag number.

Could you please try to link with other blacs libraries? Probably sgimpt is what I would try first...

0 Kudos
Highlighted
Beginner
84 Views

Ok, many thanks for your help!

I will try first with libmkl_blacs_sgimpt_lp64 (and mkl_blacs_openmpi_lp64). I will post if this solve the problem.

0 Kudos
Highlighted
Beginner
84 Views

I am seeing the same invalid tag value with scalapack's PDGESV on cray XC30,

 MPI_Recv(192): MPI_Recv(buf=0x30a0d50, count=2, MPI_INT, src=8, tag=5000000, comm=0x84000004, status=0x7fffffff0d38) failed
MPI_Recv(113): Invalid tag, value is 5000000

 

The situation is very similar. On our small cluster I compile with mpiifort and -mkl=cluster, and on XC30 I use the cray ftn wrapper for ifort but with -mkl=cluster as well. It never crashed like this on my home cluser, and it seems to be only affecting certain calculations, maybe the ones where pdgsev is used instead of pzgesv...

I don't think that compiling with a different blacs will solve my problem.

 

0 Kudos
Highlighted
Beginner
84 Views

Hi,

I have tried with other BLACS libraries without success (I am unable of use many of them).

After contacting with IT of the XC30, I provided him with a "simple" test version. In my case, the problem seems to appear when the program inverts a matrix using pdgetrf and pdgetri functions. They have told me that seems a bug in MKL and they recommend me to change to their own libraries (libsci). However, I am using some MKL specific functions and I would like to continue using MKL.

 

0 Kudos
Highlighted
84 Views

Thank you Oriol and Ariel for pointing the functions!

I will try to reproduce the problem using these functions and come back as soon as I will get any updates!

0 Kudos
Highlighted
Beginner
84 Views

Hi,

We have also been experiencing this issue. Has any progress been made?

Many thanks,

Joly

Info:

Cray XC30

MPICH2 (cray-mpich/6.3.1)

Intel Compiler (intel/14.0.1.106) and MKL

Image      Rank 39 [Fri Jun 27 10:29:28 2014] [c0-1c2s12n0] Fatal error in MPI_Recv: Invalid tag, error stack:
MPI_Recv(192): MPI_Recv(buf=0x51f8de0, count=1, MPI_INT, src=38, tag=5000000, comm=0xc4000001, status=0x7fffffff12b8) failed
MPI_Recv(113): Invalid tag, value is 5000000

Backtrace:

...

...

Dense invert : line 1138

pdgetrf

mpl_lu

mpl_lu_nb

mpl_pivot_comm

MKL_RECV

PMPI_Recv

 

0 Kudos
Highlighted
Beginner
84 Views

As long as I know, no progress has been made.

On Cray XC30, I have to use libsci replacement when I need these functions.

 

0 Kudos