- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MKL 2018 Update 3 has broken zgetri.
Our QA process now crashes in a call to zgetri. Reverting to MKL Update 2 DLL's resolves the issue. It does not happen on every call to zgetri
I will point out that Intel Inspector complains about zgetri and data races (and has for a long time...)
Not Flagged > 14996 0 Main Thread Main Thread libiomp5md.dll!__kmp_task_team_wait [External Code] libiomp5md.dll!__kmp_task_team_wait(kmp_info * this_thr, kmp_team * team, void * itt_sync_obj, int wait) Line 401 libiomp5md.dll!__kmp_join_barrier(int gtid) Line 2037 libiomp5md.dll!__kmp_join_call(ident * loc, int gtid, fork_context_e fork_context, int exit_teams) Line 7493 libiomp5md.dll!__kmpc_fork_call(ident * loc, int argc, void(*)(int *, int *) microtask) Line 372 mkl_intel_thread.dll!000007feb0d84b70() mkl_core.dll!000007feacbc61fe()
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can reproduce the same issue on Linux
(gdb) bt #0 0x00007fffe457b288 in __kmp_execute_tasks_64 () from libiomp5.so #1 0x00007fffe450d38d in _INTERNAL_25_______src_kmp_barrier_cpp_71f3cf03::__kmp_hyper_barrier_release(barrier_typ e, kmp_info*, int, int, int, void*) () from /opt/ESI/VAOne2018/libiomp5.so #2 0x00007fffe450e7a2 in __kmp_fork_barrier(int, int) () from libiomp5.so #3 0x00007fffe454fb23 in __kmp_launch_thread () from libiomp5.so #4 0x00007fffe4589c30 in _INTERNAL_26_______src_z_Linux_util_cpp_ea62c7c0::__kmp_launch_worker(void*) () from /opt/ESI/VAOne2018/libiomp5.so #5 0x00007fffe4291064 in start_thread (arg=0x7fffcb23ea00) at pthread_create.c:309 #6 0x00007fffe0b6262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see no issues with my internal tests. Could you share us reproducer or input parameters? You may set MKL_VERBOSE and shared the output.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I will try and get more info using MKL_VERBOSE
Just to reiterate the steps
- We run thousands of QA tests
- Updating from 2018 Update 2 to Update 3 resulted in a few failures ( crashes). Reverting the MKL DLL's (only) solved the issue
- The failures are in a call to zgetri
- The failures happen on both Windows and Linux ( this is usually a helpful hint for debugging as it eliminates platform issues)
- Sensitive to # of threads. I had to set OMP_NUM_THREADS=8 on Linux ( same as Windows)
- If OMP_NUM_THREADS=1 , there are no failures.
A code review of any changes to zgetri from Update 2->3 might be worthwhile
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From Linux
MKL_VERBOSE DGEMM(N,N,696,85,696,0x7ff9fa764008,0x7ff9e0206b80,696,0x7ffa200805c0,696,0x7ff9fa764000,0x7ffa1867c400,696) 1.01ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4 MKL_VERBOSE ZGEMM3M(N,N,85,85,696,0x7ff9fa764150,0x7ff9d6aea0c0,85,0x7ff9e3872300,696,0x7ff9fa764160,0x7ffa28a643c0,85) 671.57us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4 MKL_VERBOSE ZLANGE(1,85,85,0x7ffa013a3780,85,0x22c54c0) 44.34us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE ZGETRF(85,85,0x7ffa013a3780,85,0x7ffa2023be80,0) 268.39us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE ZGECON(1,85,0x7ffa013a3780,85,0x7fff7c91f950,0x7fff7c91fb80,0x22c6930,0x22c54c0,0) 137.77us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 MKL_VERBOSE ZGETRI(85,0x7ff9e3ac67c0,85,0x7ffa2023be80,0x7fff7c91e1c8,-1,0) 2.08us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 line 94: 184309 Segmentation fault
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gennady,
I left out some important information. ZGETRI is being called from multiple boost::thread(s). If I only use one boost::thread ( but leave MKL_NUM_THREADS=8) then the problem goes away. So either there is some kind of data race inside zgetri (or my code , clearly). I will run Intel Inspector xe to see if I can locate any issues.
Andrew
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To close out this issue. I was compiling my code with -axAVX. On both Linux and Windows this has seemed to not be 100% robust and caused problems. When I removed the -ax option , all my issues were resolved.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
More on this issue. It came back to bite me. While trying to generate a test case I found that the pivots array , generated by zgetrf contained occasional garbage entries ( large -ve numbers). This causes the ZGETRI to crash , of course. Its quite odd as it only happens occasionally even with the same input data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The problem with zgetri continues. The latest issue is "bogus" value returned when lwork=-1 ( estimating workspace). I am getting obviously garbage numbers in work[0].
I have attached a sample program on a 7x7 matrix. It does not fail when running this same problem. But I get failures in my code.
zgetri fails Intel Inspector with two data races when called from a parallel region. See attached JPEG.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just to add some detail, when the zgetri is called in my threaded running code, with the exact same matrix as in the example, it will "sometimes" return garbage numbers as per below in "work". If it was working properly the first entry should be cast to int of value 7.
I am wondering how this can possibly happen. When passed (-1) in lwork, zgetri should compute a work array size. I would assume that for small "n" such as 7, it would be a very simple calculation and should be thread safe.
work={-nan,3.458459520889e-323#DEN}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andrew,
Thanks for your report. I tried to reproduce the reproducer you provided with MKL 2018 Update 3 but it did not fail.
Could you please provide more information regarding the last issue you reported (with incorrect work[0] value):
- MKL_VERBOSE output
- Platform details: OS, hardware
- Environment: number of threads
- Link line
- Compiler version
Also any additional diagnostics on your side would be very helpful (if it's Linux then it might be valgrind output, any additional printing - e.g. is garbage actually written to work[0] or it's simply was there before the call, etc). Also if you can provide built executable that demonstrates the problem - it would be also helpful.
Also I was able to get the same output from Intel Inspector with two data races you mentioned but I believe they both are unrelated to this problem. The issues are about writing the same pointer/integer value to the same location from different threads. It is definitely a data race but on most systems such writes are atomic and shouldn't cause any issues. But anyway we will fix that issue in next releases.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Eugene,
Thanks for your reply. I spent some more time looking at this. The example I sent is probably not going to be a reproducer - I went off on a wrong track with that one ..except for one thing. Using Intel Inspector on Windows I get a R/W data race and a W/W data race inside zgetri in that reproducer. As an example, this is very frustrating when trying to diagnose problems as the reported data races inside MKL, and in particular in zgetri/zgetrf make me want to throw my hands in the air when trying to diagnose multithread problems in my own code.
From my point of view here is my summary
- After updating from 2018 Update 2 to 2018 Update 3 our QA process fails and/or crashes inside zgetrf/zgetri both on Windows and Linux
- Reverting ONLY MKL DLLs to Update 2, the problems go away
- The crash/failures are a bit random and do not happen when we use only one thread to call MKL
- Post-mortem examining the data shows that the "pivots" array has a garbage number in it ( large -ve) that of course results in the crash
I suppose I am asking that someone do a code review on Update 2->Update 3 changes for zgetrf/zgetri. I am using MKL_DIRECT_CALL but the matrices I am seeing the crash in are 384x384
Andrew
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page