Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6956 Discussions

Calling dgetrf_ before fork() causes HANG with MKL_CBWR=COMPATIBLE

EricChamberland
Beginner
552 Views

Hi,

I have a mwe with the bug.  I reproduced with the following setup:

export MKL_CBWR=COMPATIBLE

export MKL_VERBOSE=1

./foo

Output:

Foo is called
MKL_VERBOSE Intel(R) MKL 2018.0 Update 1 Product build 20171007 for Intel(R) 64 architecture Intel(R) Architecture processors, Lnx 3.20GHz lp64 intel_thread NMICDev:0
MKL_VERBOSE DGETRF(27,27,0x1f34280,27,0x7ffd36f28c20,27) 22.90ms CNR:COMPATIBLE Dyn:1 FastMM:1 TID:0  NThr:6 WDiv:HOST:+0.000
Calling MPI_Init:


foo compiled with:

mpicxx -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lmkl_blacs_intelmpi_lp64 -liomp5 -ldl -lpthread -o foo foo.cc
 

mpicxx tested: OpenMPI {2.1.2,3.0.1}

It works on another computer with Intel MKL 2015 installed:

./foo  

Foo is called
MKL_VERBOSE Intel(R) MKL 11.2 Update 2 Product build 20150120 for Intel(R) 64 architecture Intel(R) Architecture processors, Lnx 2.40GHz lp64 intel_thread NMICDev:0
MKL_VERBOSE DGETRF(27,27,0xdc7c20,27,0x7ffc8b6f1870,27) 8.38ms CNR:COMPATIBLE Dyn:1 FastMM:1 TID:0  NThr:12 WDiv:HOST:+0.000
Calling MPI_Init:
MPI_Init done...
Foo is called
MKL_VERBOSE DGETRF(27,27,0xf15610,27,0x7ffc8b6f18e0,27) 127.27us CNR:COMPATIBLE Dyn:1 FastMM:1 TID:0  NThr:12 WDiv:HOST:+0.000

 

It also works if I set MKL_NUM_THREADS=1

Thanks,

Eric

 

0 Kudos
7 Replies
Gennady_F_Intel
Moderator
552 Views

Eric, could you please check this case with MKL 2018 u2 we released 2 weeks ago!

0 Kudos
EricChamberland
Beginner
552 Views

Hi M. Gennady,

we installed the latest update this morning and I tested it:

Foo is called 

MKL_VERBOSE Intel(R) MKL 2018.0 Update 2 Product build 20180127 for Intel(R) 64 architecture Intel(R) Architecture processors, Lnx 3.20GHz lp64 intel_thread
MKL_VERBOSE DGETRF(27,27,0x1bf4280,27,0x7ffff1fce520,27) 17.55ms CNR:COMPATIBLE Dyn:1 FastMM:1 TID:0  NThr:6
Calling MPI_Init:

It hangs at the same place.  I have been suggested to change mkl_blacs_intelmpi_lp64 to mkl_blacs_openmpi_lp64, but it changed nothing!  I also changed -lmkl_intel_thread to lmkl_gnu_thread, but still have the same problem.

Here is the backtrace when it hang:

(gdb) bt
#0  0x00007fffef681e47 in sched_yield () from /lib64/libc.so.6
#1  0x00007ffff0a5fe74 in _INTERNAL_26_______src_z_Linux_util_cpp_d7ee2e5e::__kmp_atfork_prepare () at ../../src/z_Linux_util.cpp:1534
#2  0x00007fffef66852d in fork () from /lib64/libc.so.6
#3  0x00007fffea07c842 in rte_init.part () from /opt/openmpi-3.0.1/lib/openmpi/mca_ess_singleton.so
#4  0x00007fffef30f8c6 in orte_init () from /opt/openmpi-3.0.1/lib/libopen-rte.so.40
#5  0x00007ffff028f4ec in ompi_mpi_init () from /opt/openmpi-3.0.1/lib/libmpi.so.40
#6  0x00007ffff02b7bdb in PMPI_Init () from /opt/openmpi-3.0.1/lib/libmpi.so.40
#7  0x0000000000400851 in main ()

I also have followed a suggestion from a reply on OpenMPI issue I have opened:

https://github.com/open-mpi/ompi/issues/5070#issuecomment-381572059

But as reported, it did not fixed anything...

Thanks,

Eric

 

0 Kudos
EricChamberland
Beginner
552 Views

Hi,

Gilles Gouaillardet from OpenMPI narrowed the problem down to a simple call to "fork" that is causing the bug, independently from OpenMPI itself. Please give a try at his reproducer:

cat a.cpp:

#include <unistd.h>
#include <cstdio>
#include  <mkl.h>

int foo() {
   printf("Foo is called\n");
   const int lN = 27;
   double* lA = (double*)malloc(lN*lN*sizeof(double));
   for (int i = 0; i < lN*lN; ++i) {
     lA = i;
   }
   int lPiv[lN];
   int lRes;
   dgetrf_(&lN, &lN, lA, &lN, lPiv, &lRes);
   return lRes;
}

int main(int pArgc, char* pArgv[])
{
   foo();
   printf("Forking:\n");
   pid_t pid = fork();
   if (0 == pid) {
       exit(0);
   }
   printf("Forked...\n");
   foo();
   return 0;
}

compiled and launched with:

g++ a.cpp -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_gnu_thread -liomp5
$ MKL_CBWR=COMPATIBLE ./a.out 

(extracted from Gilles reply: https://github.com/open-mpi/ompi/issues/5070#issuecomment-382224431 )

Thanks,

Eric

 

0 Kudos
Gennady_F_Intel
Moderator
552 Views

Thanks Eric. I see the similar hanging with intel threading ( lmkl_intel_thread) also. the case is escalated. We will keep into updated.

0 Kudos
Alexander_K_Intel2
552 Views
0 Kudos
Gennady_F_Intel
Moderator
552 Views

Eric, the suggested workaround the problem is to use export KMP_INIT_AT_FORK=FALSE till compiler team will not fix the problem.

0 Kudos
EricChamberland
Beginner
552 Views

Ok, thanks, it works for me!

You wrote:

"...FALSE till compiler team will not fix the problem."

but I hope you meant "till compiler team will fix the problem."... ???

:)

Thanks,

Eric

 

0 Kudos
Reply