Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6977 Discussions

MKL FFT Cluster: real DFTI_BAD_DESCRIPTOR problem or internal MPI fault?

Filippo_Spiga
Beginner
1,067 Views
Hi all, I have a problem with MKL FFT Cluster routines. I have followed the code example "1D In-place Cluster FFT Computations" and I have tried to run a simple parallel job on two different clusters equipped with different processors (Intel Xeon CPU X5570 @ 2.93GHz and Dual-Core AMD Opteron Processor 2218), different resource managers (SLURM and LSF) and different versions of the MKL library (10.0.010 and 10.0.2). The underline MPI runtime environment is Open MPI 1.3.2, compiled using INTEL Compiler 10.1 . On both cluster, the test program crashes at the same point with the same error.

This is the output
$ cat out
I'm 3 and I have passed STEP 1
I'm 0 and I have passed STEP 1
I'm 1 and I have passed STEP 1
I'm 2 and I have passed STEP 1

TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00000 node0027 ./x.fft_mpi_mkl Exit (5) 09/08/2009 21:53:25
00001 node0028 ./x.fft_mpi_mkl Exit (5) 09/08/2009 21:53:25
00002 node0023 ./x.fft_mpi_mkl Exit (5) 09/08/2009 21:53:25
00003 node0021 ./x.fft_mpi_mkl Exit (5) 09/08/2009 21:53:25

and this--------------------------------------------------------------------------
[node0021:05836] *** An error occurred in MPI_comm_size
[node0021:05836] *** on communicator MPI_COMM_WORLD
[node0021:05836] *** MPI_ERR_COMM: invalid communicator
[node0021:05836] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[node0023:25571] *** An error occurred in MPI_comm_size
[node0027:30416] *** An error occurred in MPI_comm_size
[node0023:25571] *** on communicator MPI_COMM_WORLD
[node0023:25571] *** MPI_ERR_COMM: invalid communicator
[node0023:25571] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[node0027:30416] *** on communicator MPI_COMM_WORLD
[node0027:30416] *** MPI_ERR_COMM: invalid communicator
[node0027:30416] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[node0028:15920] *** An error occurred in MPI_comm_size
[node0028:15920] *** on communicator MPI_COMM_WORLD
[node0028:15920] *** MPI_ERR_COMM: invalid communicator
[node0028:15920] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------

I have compiled both programs using this command
$ mpicc -openmp -w test.c -Wl,--start-group $MKL_INCLUDE $MKL_LIB/libmkl_cdft_core.a $MKL_LIB/libmkl_blacs_openmpi_lp64.a $MKL_LIB/libmkl_intel_lp64.a $MKL_LIB/libmkl_intel_thread.a $MKL_LIB/libmkl_core.a -Wl,--end-group -L$MKL_LIB -liomp5 -lpthread -lm -o x.fft_mpi_mkl ("lib/em64t" in both case)

The program, with stupid additions to print debug informations, is attached to this post.
What's wrong? Can the problem be related to Open MPI?

Thank you very much in advance ,
Cheers!
0 Kudos
1 Solution
Vladimir_Petrov__Int
New Contributor III
1,067 Views
Hi Filippo,

I finally figured out what the problem is.
Since Open MPI considers MPI_COMM_WORLD to be a pointer it turns out to be 64-bit long. Whereas Cluster FFT was designed in times where sizeof(MPI_Comm) used to be 32-bit. In order to work correctly with Open MPI you just need to wrap the communicator as follows:

DftiCreateDescriptorDM(MPI_Comm_c2f(MPI_COMM_WORLD),&desc,DFTI_DOUBLE,DFTI_COMPLEX,1,len);

Best regards,
-Vladimir

P.S. I hope you will agree that later is better than never.

View solution in original post

0 Kudos
11 Replies
Dmitry_B_Intel
Employee
1,067 Views

Hi Filippo,
Have you checked that compiler doesn't complaint on undeclared malloc? If undeclared it is assumed to return int, possibly cutting pointers to 32bit. This might be the cause of the failure you see.
Thanks
Dima
0 Kudos
Filippo_Spiga
Beginner
1,067 Views

Hi Filippo,
Have you checked that compiler doesn't complaint on undeclared malloc? If undeclared it is assumed to return int, possibly cutting pointers to 32bit. This might be the cause of the failure you see.
Thanks
Dima

Hi Dmitry,
are you suggesting to compile using "-m64" flag explicitly?

I have just tried to add "-m64" flag but nothing changes. I think that the problem can be related to OpenMPI. I switch from Open MPI 1.3.2 to OpenMPI 1.2.6 but the problem persist!


>$ mpicc -openmp -m64 test.c -Wl,--start-group $MKL_INCLUDE $MKL_LIB/libmkl_cdft_core.a $MKL_LIB/libmkl_blacs_openmpi_lp64.a $MKL_LIB/libmkl_intel_lp64.a $MKL_LIB/libmkl_intel_thread.a $MKL_LIB/libmkl_core.a -Wl,--end-group -L$MKL_LIB -liomp5 -lpthread -lm -o x.fft_mpi_mkl

>$ ldd x.fft_mpi_mkl
libiomp5.so => /opt/MKL/10.0.2/intel--10.1/lib/em64t/libiomp5.so (0x00002aaaaaac7000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaaacc6000)
libimf.so => /opt/intel/fce/10.1.011/lib/libimf.so (0x00002aaaaaee0000)
libm.so.6 => /lib64/libm.so.6 (0x00002aaaab243000)
libmpi.so.0 => /opt/openmpi/1.2.6/intel--10.1/lib/libmpi.so.0 (0x00002aaaab4c6000)
libopen-rte.so.0 => /opt/openmpi/1.2.6/intel--10.1/lib/libopen-rte.so.0 (0x00002aaaab854000)
libopen-pal.so.0 => /opt/openmpi/1.2.6/intel--10.1/lib/libopen-pal.so.0 (0x00002aaaabb61000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00002aaaabdd5000)
librt.so.1 => /lib64/librt.so.1 (0x00002aaaabfe0000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaac1ea000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00002aaaac3ee000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaac606000)
libguide.so => opt/MKL/10.0.2/intel--10.1/lib/em64t/libguide.so (0x00002aaaac80a000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003081c00000)
libc.so.6 => /lib64/libc.so.6 (0x00002aaaac971000)
/lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
libsvml.so => /opt/intel/fce/10.1.011/lib/libsvml.so (0x00002aaaaccc2000)
libintlc.so.5 => /opt/intel/fce/10.1.011/lib/libintlc.so.5 (0x00002aaaace49000)


[node0020:22164] *** An error occurred in MPI_comm_size
[node0020:22164] *** on communicator MPI_COMM_WORLD
[node0020:22164] *** MPI_ERR_COMM: invalid communicator
[node0020:22164] *** MPI_ERRORS_ARE_FATAL (goodbye)
[node0027:04801] *** An error occurred in MPI_comm_size
[node0027:04801] *** on communicator MPI_COMM_WORLD
[node0027:04801] *** MPI_ERR_COMM: invalid communicator
[node0027:04801] *** MPI_ERRORS_ARE_FATAL (goodbye)
[node0019:09289] *** An error occurred in MPI_comm_size
[node0019:09289] *** on communicator MPI_COMM_WORLD
[node0019:09289] *** MPI_ERR_COMM: invalid communicator
[node0019:09289] *** MPI_ERRORS_ARE_FATAL (goodbye)
[node0022:32707] *** An error occurred in MPI_comm_size
[node0022:32707] *** on communicator MPI_COMM_WORLD
[node0022:32707] *** MPI_ERR_COMM: invalid communicator
[node0022:32707] *** MPI_ERRORS_ARE_FATAL (goodbye)




0 Kudos
Dmitry_B_Intel
Employee
1,067 Views

Filippo,

I am sorry I didn't put it clearly. I noticed that #include is missing in your example, and so I suspect that compiler uses implicit declaration for it, which is 'int malloc()' instead of 'void *malloc(size_t)'.

Thanks
Dima
0 Kudos
Vladimir_Petrov__Int
New Contributor III
1,067 Views
Hi Filippo,

According to the release notes, MKL currently supports only the versions 1.2.x of OpenMPI.
Do you have a possibility to try this older version?

Best regards,
-Vladimir
0 Kudos
Filippo_Spiga
Beginner
1,067 Views
Dear all,
I have tried using "#include " and different versions of OpenMPI (1.2.5, 1.2.6 and 1.2.7) compiled with icc 10.1. The problem remains the same as above. Only on one of the two cluster I have the possibility to modify or change the OpenMPI version.

If it can be useful, OpenMPI 1.2.7 was compiled using these flags...

export CC=icc
export CXX=icc
export F77=ifort
export F90=ifort
export FC=$F90
export CFLAGS="-O2"
export CXXFLAGS="$CFLAGS -lstdc++"
export FFLAGS="-O2"
export FCFLAGS="-O2"
export LDFLAGS="-O2"
export F77FLAGS="-02"

./configure --prefix=... --disable-ipv6 --enable-static --with-openib=/usr/local/ofed --with-openib-libdir=/usr/local/ofed/lib64 --with-mpi-f90-size=medium --with-io-romio-flags="--with-filesystems=ufs" --enable-mpi-threads --enable-cxx-exceptions
0 Kudos
Vladimir_Petrov__Int
New Contributor III
1,067 Views
Hi Filippo,

While I am trying to reproduce your issue, could you please check the following two potential problems with compiling your test:
1. The MKL_INCLUDE directory does not seem to be placed correctly on the compile-link line - please prepend it with "-I" and place outside the group-clauses;
2. ldd reports dependecies on two Intel threading libraries - both libguide.so and libiomp5.so. I would strongly recommend that you use exactly one - that which was used to build Open MPI.

Best regards,
-Vladimir
0 Kudos
Filippo_Spiga
Beginner
1,067 Views
1. The MKL_INCLUDE directory does not seem to be placed correctly on the compile-link line - please prepend it with "-I" and place outside the group-clauses;
2. ldd reports dependecies on two Intel threading libraries - both libguide.so and libiomp5.so. I would strongly recommend that you use exactly one - that which was used to build Open MPI.

1. MKL_INCLUDE environment variable includes "-I"
$ echo $MKL_INCLUDE
-I/opt/MKL/10.0.2/intel--10.1/include
$ ls -1 /opt/MKL/10.0.2/intel--10.1/includefftw
i_malloc.h
mkl_blas.f90
mkl_blas.fi
mkl_blas.h
mkl_cblas.h
[...and so on...]

2. I have tried "-lguide" instead of "-liomp5" ...

$ ldd x.fft_mpi_mkl
libguide.so => /opt/MKL/10.0.2/intel--10.1/lib/em64t/libguide.so (0x00002aaaaaac7000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaaacc6000)
libmpi.so.0 => /opt/openmpi/1.2.7/intel--10.1/lib/libmpi.so.0 (0x00002aaaaaee0000)
libopen-rte.so.0 => /opt/openmpi/1.2.7/intel--10.1/lib/libopen-rte.so.0 (0x00002aaaab285000)
libopen-pal.so.0 => /opt/openmpi/1.2.7/intel--10.1/lib/libopen-pal.so.0 (0x00002aaaab59a000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00002aaaab80f000)
librt.so.1 => /lib64/librt.so.1 (0x00002aaaaba1b000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaabc24000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00002aaaabe28000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaac041000)
libm.so.6 => /lib64/libm.so.6 (0x00002aaaac244000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003081c00000)
libc.so.6 => /lib64/libc.so.6 (0x00002aaaac4c8000)
/lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
libimf.so => /opt/intel/fce/10.1.011/lib/libimf.so (0x00002aaaac818000)
libsvml.so => /opt/intel/fce/10.1.011/lib/libsvml.so (0x00002aaaacb7a000)
libintlc.so.5 => /opt/intel/fce/10.1.011/lib/libintlc.so.5 (0x00002aaaacd02000)

and also with Open MPI 1.2.5

[cin8310a@node1310 test_mkl_fft]$ ldd x.fft_mpi_mkl
libguide.so => /opt/MKL/10.0.2/intel--10.1/lib/em64t/libguide.so (0x00002aaaaaac7000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaaacc6000)
libimf.so => /opt/intel/fce/10.1.011/lib/libimf.so (0x00002aaaaaee0000)
libm.so.6 => /lib64/libm.so.6 (0x00002aaaab243000)
libmpi.so.0 => /opt/openmpi/1.2.5/intel/10.1/lib/libmpi.so.0 (0x00002aaaab4c6000)
libopen-rte.so.0 => /opt/openmpi/1.2.5/intel/10.1/lib/libopen-rte.so.0 (0x00002aaaab854000)
libopen-pal.so.0 => /opt/openmpi/1.2.5/intel/10.1/lib/libopen-pal.so.0 (0x00002aaaabb61000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00002aaaabdd5000)
librt.so.1 => /lib64/librt.so.1 (0x00002aaaabfe0000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaac1ea000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00002aaaac3ee000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaac606000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003081c00000)
libc.so.6 => /lib64/libc.so.6 (0x00002aaaac80a000)
/lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
libsvml.so => /opt/intel/fce/10.1.011/lib/libsvml.so (0x00002aaaacb5a000)
libintlc.so.5 => /opt/intel/fce/10.1.011/lib/libintlc.so.5 (0x00002aaaacce2000)

No changes :-(

Thank you for the support
0 Kudos
Vladimir_Petrov__Int
New Contributor III
1,067 Views
Hi Filippo,

I managed to reproduce your issue on my local machine. It looks like it is caused by incompatibility between MKL BLACS and the options you use to build Open MPI.

BTW, is F77FLAGS really set to "-02" (where "0" is zero)?

Best regards,
-Vladimir
0 Kudos
Filippo_Spiga
Beginner
1,067 Views
BTW, is F77FLAGS really set to "-02" (where "0" is zero)?

Yes. If you are able to suggest me the flags to have full compatibility with MKL I will try to recompile Open MPI and make other tests

Thanks a lot!
0 Kudos
Vladimir_Petrov__Int
New Contributor III
1,068 Views
Hi Filippo,

I finally figured out what the problem is.
Since Open MPI considers MPI_COMM_WORLD to be a pointer it turns out to be 64-bit long. Whereas Cluster FFT was designed in times where sizeof(MPI_Comm) used to be 32-bit. In order to work correctly with Open MPI you just need to wrap the communicator as follows:

DftiCreateDescriptorDM(MPI_Comm_c2f(MPI_COMM_WORLD),&desc,DFTI_DOUBLE,DFTI_COMPLEX,1,len);

Best regards,
-Vladimir

P.S. I hope you will agree that later is better than never.
0 Kudos
Filippo_Spiga
Beginner
1,067 Views
Since Open MPI considers MPI_COMM_WORLD to be a pointer it turns out to be 64-bit long. Whereas Cluster FFT was designed in times where sizeof(MPI_Comm) used to be 32-bit. In order to work correctly with Open MPI you just need to wrap the communicator as follows:

DftiCreateDescriptorDM(MPI_Comm_c2f(MPI_COMM_WORLD),&desc,DFTI_DOUBLE,DFTI_COMPLEX,1,len);

Great, it works!


P.S. I hope you will agree that later is better than never.

I agree with you (-: Thank you very much again for your support!

Regards
0 Kudos
Reply