MKL 2018 Update 3 has broken zgetri.
Our QA process now crashes in a call to zgetri. Reverting to MKL Update 2 DLL's resolves the issue. It does not happen on every call to zgetri
I will point out that Intel Inspector complains about zgetri and data races (and has for a long time...)
Not Flagged > 14996 0 Main Thread Main Thread libiomp5md.dll!__kmp_task_team_wait [External Code] libiomp5md.dll!__kmp_task_team_wait(kmp_info * this_thr, kmp_team * team, void * itt_sync_obj, int wait) Line 401 libiomp5md.dll!__kmp_join_barrier(int gtid) Line 2037 libiomp5md.dll!__kmp_join_call(ident * loc, int gtid, fork_context_e fork_context, int exit_teams) Line 7493 libiomp5md.dll!__kmpc_fork_call(ident * loc, int argc, void(*)(int *, int *) microtask) Line 372 mkl_intel_thread.dll!000007feb0d84b70() mkl_core.dll!000007feacbc61fe()
I can reproduce the same issue on Linux
(gdb) bt #0 0x00007fffe457b288 in __kmp_execute_tasks_64 () from libiomp5.so #1 0x00007fffe450d38d in _INTERNAL_25_______src_kmp_barrier_cpp_71f3cf03::__kmp_hyper_barrier_release(barrier_typ e, kmp_info*, int, int, int, void*) () from /opt/ESI/VAOne2018/libiomp5.so #2 0x00007fffe450e7a2 in __kmp_fork_barrier(int, int) () from libiomp5.so #3 0x00007fffe454fb23 in __kmp_launch_thread () from libiomp5.so #4 0x00007fffe4589c30 in _INTERNAL_26_______src_z_Linux_util_cpp_ea62c7c0::__kmp_launch_worker(void*) () from /opt/ESI/VAOne2018/libiomp5.so #5 0x00007fffe4291064 in start_thread (arg=0x7fffcb23ea00) at pthread_create.c:309 #6 0x00007fffe0b6262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
I will try and get more info using MKL_VERBOSE
Just to reiterate the steps
A code review of any changes to zgetri from Update 2->3 might be worthwhile
MKL_VERBOSE DGEMM(N,N,696,85,696,0x7ff9fa764008,0x7ff9e0206b80,696,0x7ffa200805c0,696,0x7ff9fa764000,0x7ffa1867c400,696) 1.01ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4 MKL_VERBOSE ZGEMM3M(N,N,85,85,696,0x7ff9fa764150,0x7ff9d6aea0c0,85,0x7ff9e3872300,696,0x7ff9fa764160,0x7ffa28a643c0,85) 671.57us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4 MKL_VERBOSE ZLANGE(1,85,85,0x7ffa013a3780,85,0x22c54c0) 44.34us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE ZGETRF(85,85,0x7ffa013a3780,85,0x7ffa2023be80,0) 268.39us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE ZGECON(1,85,0x7ffa013a3780,85,0x7fff7c91f950,0x7fff7c91fb80,0x22c6930,0x22c54c0,0) 137.77us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE ZGETRI(85,0x7ff9e3ac67c0,85,0x7ffa2023be80,0x7fff7c91e1c8,-1,0) 2.08us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 line 94: 184309 Segmentation fault
I left out some important information. ZGETRI is being called from multiple boost::thread(s). If I only use one boost::thread ( but leave MKL_NUM_THREADS=8) then the problem goes away. So either there is some kind of data race inside zgetri (or my code , clearly). I will run Intel Inspector xe to see if I can locate any issues.
To close out this issue. I was compiling my code with -axAVX. On both Linux and Windows this has seemed to not be 100% robust and caused problems. When I removed the -ax option , all my issues were resolved.
More on this issue. It came back to bite me. While trying to generate a test case I found that the pivots array , generated by zgetrf contained occasional garbage entries ( large -ve numbers). This causes the ZGETRI to crash , of course. Its quite odd as it only happens occasionally even with the same input data.
The problem with zgetri continues. The latest issue is "bogus" value returned when lwork=-1 ( estimating workspace). I am getting obviously garbage numbers in work.
I have attached a sample program on a 7x7 matrix. It does not fail when running this same problem. But I get failures in my code.
zgetri fails Intel Inspector with two data races when called from a parallel region. See attached JPEG.
Just to add some detail, when the zgetri is called in my threaded running code, with the exact same matrix as in the example, it will "sometimes" return garbage numbers as per below in "work". If it was working properly the first entry should be cast to int of value 7.
I am wondering how this can possibly happen. When passed (-1) in lwork, zgetri should compute a work array size. I would assume that for small "n" such as 7, it would be a very simple calculation and should be thread safe.
Thanks for your report. I tried to reproduce the reproducer you provided with MKL 2018 Update 3 but it did not fail.
Could you please provide more information regarding the last issue you reported (with incorrect work value):
Also any additional diagnostics on your side would be very helpful (if it's Linux then it might be valgrind output, any additional printing - e.g. is garbage actually written to work or it's simply was there before the call, etc). Also if you can provide built executable that demonstrates the problem - it would be also helpful.
Also I was able to get the same output from Intel Inspector with two data races you mentioned but I believe they both are unrelated to this problem. The issues are about writing the same pointer/integer value to the same location from different threads. It is definitely a data race but on most systems such writes are atomic and shouldn't cause any issues. But anyway we will fix that issue in next releases.
Thanks for your reply. I spent some more time looking at this. The example I sent is probably not going to be a reproducer - I went off on a wrong track with that one ..except for one thing. Using Intel Inspector on Windows I get a R/W data race and a W/W data race inside zgetri in that reproducer. As an example, this is very frustrating when trying to diagnose problems as the reported data races inside MKL, and in particular in zgetri/zgetrf make me want to throw my hands in the air when trying to diagnose multithread problems in my own code.
From my point of view here is my summary
I suppose I am asking that someone do a code review on Update 2->Update 3 changes for zgetrf/zgetri. I am using MKL_DIRECT_CALL but the matrices I am seeing the crash in are 384x384