Good. While this confirms

Fermin_L_ · ‎10-04-2015

I have installed a package (WIEN2k) on a cluster which requires MKL. The sequential version of the package runs properly. However, for the MPI version, I got a run-time error as

Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM

I used the following option for compilation:

FC = ifort
MPF = mpiifort
CC = icc
FOPT = -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback -assume buffered_io
FPOPT = -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback -assume buffered_io -DFFTW3 -I/usr/local/include
DParallel = '-DParallel'
FGEN = $(PARALLEL)
LDFLAGS = $(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
R_LIBS = -mkl=parallel -openmp -lpthread
C_LIBS = $(R_LIBS)
RP_LIBS = -lfftw3_mpi -lfftw3 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 $(R_LIBS)
CP_LIBS = $(RP_LIBS)
DESTDIR = ./

The version of the compilers are

ifort version 15.0.3

mpiifort for the Intel(R) MPI Library 5.0 Update 3 for Linux*
Copyright(C) 2003-2015, Intel Corporation. All rights reserved.
ifort version 15.0.3

and $LD_LIBRARY_PATH :

/usr/local/intel//impi/5.0.3.048/intel64/lib:/usr/local/intel/composer_xe_2015.3.187/compiler/lib/intel64:/usr/local/intel/composer_xe_2015.3.187/mkl/lib/intel64:/usr/local/intel/composer_xe_2015.3.187/compiler/lib/intel64:/usr/local/intel/composer_xe_2015.3.187/mpirt/lib/intel64:/usr/local/intel/composer_xe_2015.3.187/ipp/../compiler/lib/intel64:/usr/local/intel/composer_xe_2015.3.187/ipp/lib/intel64:/usr/local/intel/composer_xe_2015.3.187/ipp/tools/intel64/perfsys:/usr/local/intel/composer_xe_2015.3.187/compiler/lib/intel64:/usr/local/intel/composer_xe_2015.3.187/mkl/lib/intel64:/usr/local/intel/composer_xe_2015.3.187/tbb/lib/intel64/gcc4.4:/usr/local/intel/composer_xe_2015.3.187/debugger/libipt/intel64/lib:/usr/local/lib:/opt/gridengine/lib/linux-x64:/usr/mpi/gcc/mvapich2-1.9/lib:/usr/mpi/gcc/openmpi-1.6.5/lib64

What have been wrong in the settings?

mecej4 · ‎10-05-2015

What was MKL_TARGET_ARCH defined as when you ran the build script? Specifically, are you using the ILP64 model or the LP64 model?

Fermin_L_ · ‎10-05-2015

I am not quite sure as the libraries are installed by the Administrator. Are there any way I can check it?

As far as I remembered, the package automatically detected MKL_TARGET_ARCH as intel64. So it should be LP64, right?

TimP · ‎10-06-2015

Lp64 supports 32 bit int for mkl parameter. Ilp64 supports long long int.

Roman_D_Intel1 · ‎10-06-2015

The error you see indicates an invalid value of LDA passed to DGEMM. This actually looks like a ScaLAPACK or an application issue. Can you plese run it with MKL_VERBOSE=1 and paste here the part of the output just before the error?

Warning: be prepared for a large output. It probably should be grepped for DGEMM to reduce the size:

mpirun [usual mpirun and WIEN2k arguments] | grep DGEMM > mkl_verbose.txt

Another option is to run the MPI job under GDBs and set a breakpoint at mkl_serv_xerbla so that you can obtain a stack trace leading to a faulty function.

If this does not work for you, a yet another option is to set a custom XERBLA function and call abort() from it producing core file(s). For this to work, you need to execute 'ulimit -c unlimited' (bash) or 'limit coredumpsize unlimited' (csh) in your shell. The core files can be later inspected using gdb: 'gdb -c core', after which you can examine threads, produce backtraces, etc, as usual.

For the last two options to work best, it would be preferrable to compiler WIEN2k with debug information so that the stack traces are more informative.

Fermin_L_ · ‎10-07-2015

Thanks for all the replies.

Here is part of the output for the run with MKL_VERBOSE=1:

MKL_VERBOSE DGEMM(N,T,0,39,25,0x551a40,0x131eb10,1,0x131f400,39,0x551a40,0x131e6f0,1) 2.48us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,0,39,25,0x551a48,0x131ecd0,1,0x1323310,39,0x551a40,0x131e6f0,1) 201ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,0,39,25,0x551a40,0x131eda0,1,0x1325190,39,0x551a40,0x131e6f0,1) 160ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,0,39,25,0x551a48,0x131ee70,1,0x1327010,39,0x551a40,0x131e6f0,1) 155ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,0,39,25,0x551a40,0x131ef40,1,0x1328e90,39,0x551a40,0x131e6f0,1) 163ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,0,39,25,0x551a48,0x131f1a0,1,0x132ea00,39,0x551a40,0x131e6f0,1) 132ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,0,39,25,0x551a40,0x131f008,1,0x132ad08,39,0x551a40,0x131e6f0,1) 146ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,0,39,25,0x551a48,0x131f268,1,0x1330878,39,0x551a40,0x131e6f0,1) 144ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,39,39,25,0x551a40,0x25921d0,39,0x25a52b0,39,0x551a40,0x254bbb0,39) 15.59ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,39,39,25,0x551a48,0x2594050,39,0x25a7130,39,0x551a40,0x254bbb0,39) 60.10us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,39,39,25,0x551a40,0x2595ed0,39,0x25a8fb0,39,0x551a40,0x254bbb0,39) 45.61us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,39,39,25,0x551a48,0x2597d50,39,0x2551c10,39,0x551a40,0x254bbb0,39) 96.09us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,39,39,25,0x551a40,0x2599bd0,39,0x2553a90,39,0x551a40,0x254bbb0,39) 47.34us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,39,39,25,0x551a48,0x259f740,39,0x2559600,39,0x551a40,0x254bbb0,39) 71.87us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,39,39,25,0x551a40,0x259ba48,39,0x2555908,39,0x551a40,0x254bbb0,39) 63.82us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,39,39,25,0x551a48,0x25a15b8,39,0x255b478,39,0x551a40,0x254bbb0,39) 70.71us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(N,T,38856832,1,25,0x551a40,0x250e8d0,1,0x250f160,1,0x551a40,0x24cff30,1) 7.29us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(N,T,38856832,1,25,0x551a48,0x250ea30,1,0x250f230,1,0x551a40,0x24cff30,1) 414ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(N,T,38856832,1,25,0x551a40,0x250eb00,1,0x250f300,1,0x551a40,0x24cff30,1) 148ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(N,T,38856832,1,25,0x551a48,0x250ebd0,1,0x250f3d0,1,0x551a40,0x24cff30,1) 252ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(N,T,38856832,1,25,0x551a40,0x250eca0,1,0x250f4a0,1,0x551a40,0x24cff30,1) 143ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(N,T,38856832,1,25,0x551a48,0x250ef00,1,0x250f700,1,0x551a40,0x24cff30,1) 142ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(N,T,38856832,1,25,0x551a40,0x250ed68,1,0x250f568,1,0x551a40,0x24cff30,1) 144ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(N,T,38856832,1,25,0x551a48,0x250efc8,1,0x250f7c8,1,0x551a40,0x24cff30,1) 147ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEMM(N,T,1826158216,1,25,0x551a40,0xd73ab0,39,0xd86b90,1,0x551a40,0xd736f0,39) 829ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(N,T,1826158216,1,25,0x551a48,0xd75930,39,0xd86c60,1,0x551a40,0xd736f0,39) 274ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(N,T,1826158216,1,25,0x551a40,0xd777b0,39,0xd86d30,1,0x551a40,0xd736f0,39) 143ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(N,T,1826158216,1,25,0x551a48,0xd79630,39,0xd86e00,1,0x551a40,0xd736f0,39) 146ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(N,T,1826158216,1,25,0x551a40,0xd7b4b0,39,0xd86ed0,1,0x551a40,0xd736f0,39) 143ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
MKL_VERBOSE DGEMM(N,T,1826158216,1,25,0x551a48,0xd81020,39,0xd87130,1,0x551a40,0xd736f0,39) 145ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .

Roman_D_Intel1 · ‎10-07-2015

Good. While this confirms that DGEMM parameters are indeed invalid (for example: in the last line M=1826158216 but LDA=39), it does not explain why that happened. I doubt it is an LP/ILP64 mismatch since I'd expect a segfault in this case, and the M value fits into a 4-byte integer.

Since you're using Intel MPI, it should be straightforward to run WIEN2K under gdb from mpirun: link it to MKL statically [1], and pass '-gdb' parameter when running the application in interactive mode. This would bring you to a parallel gdb prompt similar to this:

mpigdb: np = 2
mpigdb: attaching to 35491 ./a.out <hostname>
mpigdb: attaching to 35492 ./a.out <hostname>
[0,1] (mpigdb)

Then you need to set a breakpoint in mkl_serv_default_xerbla which is the function which is called to report invalid parameters, and run the application:

[0,1] (mpigdb) br mkl_serv_default_xerbla
[1]     Breakpoint 1 at 0x409420
[0]     Breakpoint 1 at 0x409420
[0,1] (mpigdb) r
[0,1]   Continuing.

When it stops, ask gdb to display a backtrace and post the result here:

[0,1]   Breakpoint 1, 0x0000000000409420 in mkl_serv_default_xerbla ()
[0,1] (mpigdb) bt
[0,1]   #0  0x0000000000409420 in mkl_serv_default_xerbla ()
[0,1]   #1  0x0000000000408eec in mkl_blas_errchk_dgemm ()
[0,1]   #2  0x0000000000408888 in dgemm_ ()
[0]     #3  0x000000000040872c in main (argc=1, argv=0x7fff236e24b8) at g.c:12
[1]     #3  0x000000000040872c in main (argc=1, argv=0x7fffc5176918) at g.c:12

The code above is a very simple test I wrote that just calls DGEMM with invalid parameters. The backtrace from WEIN2K will hopefully be more informative.

[1] For some reason I had troubles setting breakpoints in dynamic libraries from gdb running under MPI.