Crash while calling pdmr2d for a large matrix

Yonghyun_Chung · ‎06-30-2017

Hi,

I recently parallelized my fortran90 code using Intel MKL ScaLAPACK.

With relatively smaller matrices it worked fine,

however when I ran my code for a double-precision real matrix of size 40656 by 40656, it crashed.

The error messages are as follows:

MKL_SCALAPACK_ALLOCATE in mr2d_malloc.c is unsuccessful, size = 13223282688

I have tried to resolve this problem or to find a workaround for a week, but I couldn't.

Any help or comments would be greatly appreciated.

Gennady_F_Intel · ‎06-30-2017

Hi, have you link with ILP64 or LP64 libraries?

mecej4 · ‎06-30-2017

To store a dense matrix of size 40356*40356, you need memory of size 40656*40656*8 just for the matrix. That is, ~~132~~ 13.2 Gb. How much memory (RAM) does your system have?

P.S. Corrected, thanks to Gennady F.

Gennady_F_Intel · ‎06-30-2017

no, no this is ~ 13.2 Gb which usual RAM

Yonghyun_Chung · ‎07-02-2017

Hi Gennady,

I linked with LP64 libraries. I think it would be better to use only the number of rows and columns in the array (plus its type), but if we should pass the entire size of the array as an argument, the 4 byte integer is probably not enough. I am afraid that I'm new to the ILP64 interface.

Hi mecej4,

The size of the array is ~13.2 Gb, which I mentioned in the error message I quoted in the last post. I am using a 25-node Linux cluster, where each node was equipped with 10-core processors sharing 128 GB of memory.

Yonghyun_Chung · ‎07-02-2017

The link line in my makefile is as follows:

$(MKLROOT)/lib/intel64/libmkl_scalapack_lp64.a -Wl,--start-group \
$(MKLROOT)/lib/intel64/libmkl_intel_lp64.a \
$(MKLROOT)/lib/intel64/libmkl_intel_thread.a \
$(MKLROOT)/lib/intel64/libmkl_core.a \
$(MKLROOT)/lib/intel64/libmkl_blacs_intelmpi_lp64.a \
-Wl,--end-group  -openmp -lpthread

Yonghyun_Chung · ‎07-04-2017

Hello all,

All of the integers in my source code are declared as INTEGER without kind designation, so it seems possible to compile with -i8 option and link with the ILP64 libraries.

However, in this case, all integers are changed to 8 bytes. In my source code, there is often no need to use an 8 byte integer in any parts, which is a question of efficiency. Is there a hint that you can fix with minimal changes in the code that used the LP library?

Still, I wonder if the size of the array is this much, but should I consider ILP64?

Best wishes,

Yonghyun

Yonghyun_Chung · ‎07-05-2017

Hello Gennady,

I attached a test code. If the matrix size is set to 100 x 100, there is no problem with the LP64 libraries. Setting the matrix size to 40656 x 40656 causes an error. In this case, the MKL link line is the same as the above post, and the code was compiled with -i8 option and ILP64 libraries.

I would be grateful if you could review the test code.

Cordially, Yonghyun.

Gennady_F_Intel · ‎07-09-2017

I tried to check how your example works with the latest MKL ( 2017 u3). I only print the problem sizes finally after the example passed and print version of MKL used.

mpiifort -i8 scalapack.f90 -I/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/include \
/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64/libmkl_scalapack_ilp64.a \
-Wl,--start-group \
/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64/libmkl_intel_ilp64.a \
/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64/libmkl_intel_thread.a \
/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64/libmkl_core.a \
/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64/libmkl_blacs_intelmpi_ilp64.a \
-Wl,--end-group \
-liomp5 -lpthread -lm -ldl

mpirun -n 1 ./a.out

1.00000000000000
... test passed...
ndof1 == 40656
ndof2 == 40656
Intel(R) Math Kernel Library Version 2017.0.3 Product Build 20170413 for Intel(R) 64 architecture applications

Yonghyun_Chung · ‎07-10-2017

Hi. I appreciate your feedback.

You seem to have succeeded without errors with ndof = 40656. But I didn't.

While compiling the code, I got the warnings:

/engrid/enhpc/mpich2-1.3.2-hydra-p1-no-opt/intel/v11.1.046/bin/mpif90  -c -i8 scalapack.f90
/engrid/enhpc/mpich2-1.3.2-hydra-p1-no-opt/intel/v11.1.046/bin/mpif90 -o scalapack  scalapack.o -I~/intel/compilers_and_libraries_2017.4.196/linux/mkl//include ~/intel/compilers_and_libraries_2017.4.196/linux/mkl//lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ~/intel/compilers_and_libraries_2017.4.196/linux/mkl//lib/intel64/libmkl_intel_ilp64.a ~/intel/compilers_and_libraries_2017.4.196/linux/mkl//lib/intel64/libmkl_intel_thread.a ~/intel/compilers_and_libraries_2017.4.196/linux/mkl//lib/intel64/libmkl_core.a ~/intel/compilers_and_libraries_2017.4.196/linux/mkl//lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl   
ld: Warning: size of symbol `mpifcmb1_' changed from 40 in scalapack.o to 20 in /engrid/enhpc/mpich2-1.3.2-hydra-p1-no-opt/intel/v11.1.046/lib/libmpich.a(setbot.o)
ld: Warning: size of symbol `mpifcmb2_' changed from 40 in scalapack.o to 20 in /engrid/enhpc/mpich2-1.3.2-hydra-p1-no-opt/intel/v11.1.046/lib/libmpich.a(setbot.o)
ld: Warning: size of symbol `mpifcmb3_' changed from 8 in scalapack.o to 4 in /engrid/enhpc/mpich2-1.3.2-hydra-p1-no-opt/intel/v11.1.046/lib/libmpich.a(setbot.o)
ld: Warning: size of symbol `mpifcmb4_' changed from 8 in scalapack.o to 4 in /engrid/enhpc/mpich2-1.3.2-hydra-p1-no-opt/intel/v11.1.046/lib/libmpich.a(setbot.o)
ld: Warning: size of symbol `mpifcmb5_' changed from 8 in scalapack.o to 4 in /engrid/enhpc/mpich2-1.3.2-hydra-p1-no-opt/intel/v11.1.046/lib/libmpich.a(setbot.o)
ld: Warning: size of symbol `mpifcmb6_' changed from 8 in scalapack.o to 4 in /engrid/enhpc/mpich2-1.3.2-hydra-p1-no-opt/intel/v11.1.046/lib/libmpich.a(setbot.o)
make: warning:  Clock skew detected.  Your build may be incomplete.

And, the errors occurred during the test run:

 Error INIT!
 Error SIZE!
 Error RANK!
   1.00000000000000     
 Error!

Konstantin_A_Intel · ‎07-12-2017

Hi Yonghyun,

You're trying to link ILP64 MKL libraries with mpich, which was most likely not compiled with ILP64 flag. That's why you see warnings (incompatibility of parameter sizes exactly 2 times). Another issue I see is that you pick BLACS libraries built for intel mpi, but again, link with MPICH. So, there're a few incompatibility issues.

So, my recommendation is to use Intel MPI, because if you add -i8 to mpiifort, it will link with ILP64 version of Intel MPI (probably this flag also works for MPICH, in this case you need to add it to linking line, not only compiling one):

$ ldd a.out

libmpi_ilp64.so.4 -> ...

One more note. IMO, when you call pdgemr2d, there's no need to pass context to -1 on some processors. It's enough just to create descriptors uniformly on every process - on some processors, the real amount of data just will be 0 with respect to the grid parameters (nb etc.) . That's enough for any kind of redistribution:

Global matrix:

  lm=numroc(sizeN,nb,myrow,0,nprow)
  call descinit(desc_Jmatg,sizeN,sizeN,nb,nb,0,0,ictxt,lm,info)

Local matrix (please note than nb=sizeN):

  lm=numroc(sizeN,sizeN,myrow,0,nprow)
  call descinit(desc_Jmat,sizeN,sizeN,sizeN,sizeN,0,0,ictxt,lm,info)

Redistribute from local to global:

call pdgemr2d(sizeN,sizeN,Jmat,1,1,desc_Jmatg,Jmat_loc,1,1,desc_Jmat,ictxt)

Yonghyun_Chung · ‎07-13-2017

Hi Konstantin,

I am very grateful for your taking a very detailed look at my problem. I'll gladly try to resolve my problem according to your suggestion. It feels like I've gotten a bunch of weapons on the battlefield. I'd like to ask again if I have any questions because I did't fully understand it for now.

Thank you again.

Yonghyun_Chung · ‎07-17-2017

Hi Konstantin,

I ran into another error message as follows after I installed Intel MPI and tried to compile my code with it. I have tried quite a bit to find the cause, but after all my failures, I started to think that it would be better if the experts helped me.

mpiifort -i8 scalapack.f90 -I~/intel/compilers_and_libraries_2017.4.196/linux/mkl//include ~/intel/compilers_and_libraries_2017.4.196/linux/mkl//lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ~/intel/compilers_and_libraries_2017.4.196/linux/mkl//lib/intel64/libmkl_intel_ilp64.a ~/intel/compilers_and_libraries_2017.4.196/linux/mkl//lib/intel64/libmkl_intel_thread.a ~/intel/compilers_and_libraries_2017.4.196/linux/mkl//lib/intel64/libmkl_core.a ~/intel/compilers_and_libraries_2017.4.196/linux/mkl//lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl

/home/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/lib/release_mt/libmpi.so: undefined reference to `__isoc99_fscanf@GLIBC_2.7'
/home/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/lib/release_mt/libmpi.so: undefined reference to `sched_getcpu@GLIBC_2.6'
/home/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/lib/release_mt/libmpi.so: undefined reference to `__isoc99_sscanf@GLIBC_2.7'

Yonghyun_Chung · ‎07-20-2017

Hi Konstantin,

I tried to follow your last note about the better use of pdgemr2d, which was very difficult for me to understand. Would you mind describing it in more detail, or if possible, showing us the full code that contains the changes? I think it would be better not only for me but also other scalapack users who want to further know pdgemr2d, one of the essential scaLAPACK routines. :)

Sincerely, Yonghyun