MKL 2018 Update 3 (Windows) zgetri crash

AndrewC · ‎06-19-2018

MKL 2018 Update 3 has broken zgetri.

Our QA process now crashes in a call to zgetri. Reverting to MKL Update 2 DLL's resolves the issue. It does not happen on every call to zgetri

I will point out that Intel Inspector complains about zgetri and data races (and has for a long time...)

Not Flagged	>	14996	0	Main Thread	Main Thread	libiomp5md.dll!__kmp_task_team_wait
 	 	 	 	 	 	[External Code]
 	 	 	 	 	 	libiomp5md.dll!__kmp_task_team_wait(kmp_info * this_thr, kmp_team * team, void * itt_sync_obj, int wait) Line 401
 	 	 	 	 	 	libiomp5md.dll!__kmp_join_barrier(int gtid) Line 2037
 	 	 	 	 	 	libiomp5md.dll!__kmp_join_call(ident * loc, int gtid, fork_context_e fork_context, int exit_teams) Line 7493
 	 	 	 	 	 	libiomp5md.dll!__kmpc_fork_call(ident * loc, int argc, void(*)(int *, int *) microtask) Line 372
 	 	 	 	 	 	mkl_intel_thread.dll!000007feb0d84b70()
 	 	 	 	 	 	mkl_core.dll!000007feacbc61fe()

AndrewC · ‎06-19-2018

I can reproduce the same issue on Linux

(gdb) bt
#0  0x00007fffe457b288 in __kmp_execute_tasks_64 () from libiomp5.so
#1  0x00007fffe450d38d in _INTERNAL_25_______src_kmp_barrier_cpp_71f3cf03::__kmp_hyper_barrier_release(barrier_typ                                         e, kmp_info*, int, int, int, void*) () from /opt/ESI/VAOne2018/libiomp5.so
#2  0x00007fffe450e7a2 in __kmp_fork_barrier(int, int) () from libiomp5.so
#3  0x00007fffe454fb23 in __kmp_launch_thread () from libiomp5.so
#4  0x00007fffe4589c30 in _INTERNAL_26_______src_z_Linux_util_cpp_ea62c7c0::__kmp_launch_worker(void*) ()
   from /opt/ESI/VAOne2018/libiomp5.so
#5  0x00007fffe4291064 in start_thread (arg=0x7fffcb23ea00) at pthread_create.c:309
#6  0x00007fffe0b6262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Gennady_F_Intel · ‎06-19-2018

I see no issues with my internal tests. Could you share us reproducer or input parameters? You may set MKL_VERBOSE and shared the output.

AndrewC · ‎06-20-2018

I will try and get more info using MKL_VERBOSE

Just to reiterate the steps

We run thousands of QA tests
Updating from 2018 Update 2 to Update 3 resulted in a few failures ( crashes). Reverting the MKL DLL's (only) solved the issue
The failures are in a call to zgetri
The failures happen on both Windows and Linux ( this is usually a helpful hint for debugging as it eliminates platform issues)
Sensitive to # of threads. I had to set OMP_NUM_THREADS=8 on Linux ( same as Windows)
If OMP_NUM_THREADS=1 , there are no failures.

A code review of any changes to zgetri from Update 2->3 might be worthwhile

AndrewC · ‎06-20-2018

From Linux

MKL_VERBOSE DGEMM(N,N,696,85,696,0x7ff9fa764008,0x7ff9e0206b80,696,0x7ffa200805c0,696,0x7ff9fa764000,0x7ffa1867c400,696) 1.01ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE ZGEMM3M(N,N,85,85,696,0x7ff9fa764150,0x7ff9d6aea0c0,85,0x7ff9e3872300,696,0x7ff9fa764160,0x7ffa28a643c0,85) 671.57us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE ZLANGE(1,85,85,0x7ffa013a3780,85,0x22c54c0) 44.34us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:8
MKL_VERBOSE ZGETRF(85,85,0x7ffa013a3780,85,0x7ffa2023be80,0) 268.39us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:8
MKL_VERBOSE ZGECON(1,85,0x7ffa013a3780,85,0x7fff7c91f950,0x7fff7c91fb80,0x22c6930,0x22c54c0,0) 137.77us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:8
MKL_VERBOSE ZGETRI(85,0x7ff9e3ac67c0,85,0x7ffa2023be80,0x7fff7c91e1c8,-1,0) 2.08us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:8
line 94: 184309 Segmentation fault

Gennady_F_Intel · ‎06-21-2018

Hi Andrew,

as I don't see the problem with this routine on my side. the output is below.

MKL v 2018 u3, 8 threads, zgetri, AVX -based systems

u780583_zgetri_ESI]$ ./a.out

MKL_VERBOSE Intel(R) MKL 2018.0 Update 3 Product build 20180406 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions (Intel(R) AVX) enabled processors, Lnx 2.80GHz lp64 intel_thread

MKL_VERBOSE ZGETRF(85,85,0x2261680,85,0x225c010,0) 5.64ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8

... getrf passed with info...0

MKL_VERBOSE ZGETRI(85,0x2261680,85,0x225c010,0x225c170,85,0) 6.64ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8

Сould you share with us the input matrix?

AndrewC · ‎06-21-2018

Hi Gennady,

I left out some important information. ZGETRI is being called from multiple boost::thread(s). If I only use one boost::thread ( but leave MKL_NUM_THREADS=8) then the problem goes away. So either there is some kind of data race inside zgetri (or my code , clearly). I will run Intel Inspector xe to see if I can locate any issues.

Andrew

AndrewC · ‎07-11-2018

To close out this issue. I was compiling my code with -axAVX. On both Linux and Windows this has seemed to not be 100% robust and caused problems. When I removed the -ax option , all my issues were resolved.

AndrewC · ‎07-27-2018

More on this issue. It came back to bite me. While trying to generate a test case I found that the pivots array , generated by zgetrf contained occasional garbage entries ( large -ve numbers). This causes the ZGETRI to crash , of course. Its quite odd as it only happens occasionally even with the same input data.

AndrewC · ‎08-10-2018

The problem with zgetri continues. The latest issue is "bogus" value returned when lwork=-1 ( estimating workspace). I am getting obviously garbage numbers in work[0].

I have attached a sample program on a 7x7 matrix. It does not fail when running this same problem. But I get failures in my code.

zgetri fails Intel Inspector with two data races when called from a parallel region. See attached JPEG.

AndrewC · ‎08-10-2018

Just to add some detail, when the zgetri is called in my threaded running code, with the exact same matrix as in the example, it will "sometimes" return garbage numbers as per below in "work". If it was working properly the first entry should be cast to int of value 7.

I am wondering how this can possibly happen. When passed (-1) in lwork, zgetri should compute a work array size. I would assume that for small "n" such as 7, it would be a very simple calculation and should be thread safe.

work={-nan,3.458459520889e-323#DEN}

Eugene_C_Intel1 · ‎08-10-2018

Hi Andrew,

Thanks for your report. I tried to reproduce the reproducer you provided with MKL 2018 Update 3 but it did not fail.

Could you please provide more information regarding the last issue you reported (with incorrect work[0] value):

MKL_VERBOSE output
Platform details: OS, hardware
Environment: number of threads
Link line
Compiler version

Also any additional diagnostics on your side would be very helpful (if it's Linux then it might be valgrind output, any additional printing - e.g. is garbage actually written to work[0] or it's simply was there before the call, etc). Also if you can provide built executable that demonstrates the problem - it would be also helpful.

Also I was able to get the same output from Intel Inspector with two data races you mentioned but I believe they both are unrelated to this problem. The issues are about writing the same pointer/integer value to the same location from different threads. It is definitely a data race but on most systems such writes are atomic and shouldn't cause any issues. But anyway we will fix that issue in next releases.

AndrewC · ‎08-13-2018

Hi Eugene,

Thanks for your reply. I spent some more time looking at this. The example I sent is probably not going to be a reproducer - I went off on a wrong track with that one ..except for one thing. Using Intel Inspector on Windows I get a R/W data race and a W/W data race inside zgetri in that reproducer. As an example, this is very frustrating when trying to diagnose problems as the reported data races inside MKL, and in particular in zgetri/zgetrf make me want to throw my hands in the air when trying to diagnose multithread problems in my own code.

From my point of view here is my summary

After updating from 2018 Update 2 to 2018 Update 3 our QA process fails and/or crashes inside zgetrf/zgetri both on Windows and Linux
Reverting ONLY MKL DLLs to Update 2, the problems go away
The crash/failures are a bit random and do not happen when we use only one thread to call MKL
Post-mortem examining the data shows that the "pivots" array has a garbage number in it ( large -ve) that of course results in the crash

I suppose I am asking that someone do a code review on Update 2->Update 3 changes for zgetrf/zgetri. I am using MKL_DIRECT_CALL but the matrices I am seeing the crash in are 384x384

Andrew