Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
6592 Discussions

Issue introduced in MKL 11.0 Update 4 (64-bit Linux only)

AndrewC
New Contributor II
198 Views

After installing MKL 11.0 Update 4 over MKL 11.0 Update 2 on Linux our QA process is SIGSEGV at...

#0  0x00002aaab745874a in mkl_serv_malloc ()
 #1  0x00002aaab7f6bbcc in mkl_blas_mc3_dgemm_get_bufs ()
 #2  0x00002aaab6ae8a99 in mkl_blas_mc3_xdgemm_par ()
#3  0x00002aaab4c2cf74 in mkl_blas_xdgemm_par ()
 #4  0x00002aaab4b81ecb in mkl_blas_dgemm_2d_bsrc ()
 #5  0x00002aaab4b7b489 in gemm_host ()
 #6  0x00002aaabb92b4f3 in L_kmp_invoke_pass_parms ()
   from /opt/intel/composer_xe_2013.4.183/compiler/lib/intel64/libiomp5.so

100% reproducible in certain cases.

Reverting to MKL Update 2 solves the issue.

Seems to happen after many iterations , and many threads computation created/destroyed.

Note we are running multiple (boost) threads that call MKL. We call MKL_Thread_Free_Buffers at the completion of each thread.



0 Kudos
14 Replies
Gennady_F_Intel
Moderator
198 Views

Andrew, How can we reproduce the issue?

AndrewC
New Contributor II
198 Views

The only way to reproduce is for Intel to have a copy of our software and an evaluation license from us. I will pursue this through premier support.

Gennady_F_Intel
Moderator
198 Views

ok. we will take this issue as soon as you will submit it there

AndrewC
New Contributor II
198 Views

OK, I created a ticket, but I said to reproduce Intel will have to download 400MB installer and license file but no response to that question.

No doubt, this will be a painful process for everyone to reproduce,but I cannot use MKL 11.0 Update 4 until this is resolved.

AndrewC
New Contributor II
198 Views

Premier support issue # 697704

TimP
Black Belt
198 Views

I hope you put some of the missing details in your issue submission.

I don't see any clues as to which checklists you have followed; there are several good ones, including

http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors

I can't even guess whether you explored simple remedies such as increasing stack (both global and thread stack) or using heap options.

SergeyKostrov
Valued Contributor II
198 Views
>>...Seems to happen after many iterations... Do you have that SIGSEGV error after all threads released memory and completed ( destroyed )? Or in the middle, or at the end, of processing? This is what MSDN says about that very obsolete signal-error processing constant: ... SIGSEGV Illegal storage access. The default action terminates the calling program. ...
AndrewC
New Contributor II
198 Views

Not sure what you mean by "obsolete"? On Linux, signals such as SIGSEGV are a fundamental part of the OS. A segementation violation can be caused by accessing an address that is illegal. Such as dereferencing a NULL pointer.

AndrewC
New Contributor II
198 Views

TimP (Intel) wrote:

I hope you put some of the missing details in your issue submission.

I don't see any clues as to which checklists you have followed; there are several good ones, including

http://software.intel.com/en-us/articles/determining-root-cause-of-sigse...

I can't even guess whether you explored simple remedies such as increasing stack (both global and thread stack) or using heap options.

The details are that MKL 11 Update 2 passes 300-400 QA tests without failure, MKL Update 4 fails 6+ of those tests with a segmentation violation inside MKL, reproducibly.  I have supplied premier support with a reproducible example. I will update this thread with the results.

AndrewC
New Contributor II
198 Views

Currently I am having to give the Premier support person a tutorial in GDB.

But heres a clue for anyone at Intel who cares about this issue.

Does this look like a race condition in MKL?

Thread 1 is crashing with a segmentation violation in....

#11 0x00002aaab75d40da in mkl_serv_malloc ()
   from /opt/intel/composer_xe_2013.4.183/mkl/lib/intel64/libmkl_core.so
#12 0x00002b93a4980aec in mkl_blas_mc3_dgemm_get_bufs ()

Thread 2 is calling

#0  0x00002aaab75dfe00 in mkl_blas_dgemm_set_blks_size ()

#1  0x00002aaab66135d9 in gemm_host ()

Shane_S_Intel
Employee
198 Views

Hi Andrew, we definitely care and the local MKL team is now looking into the issue. We will report back once we have more information. -Shane

AndrewC
New Contributor II
198 Views

 I just installed MKL 11 Update 5 and the problem has gone away....looks like someone found and fixed the isssue....

AndrewC
New Contributor II
198 Views

To close the loop on this issue. Intel premier support confirmed there was an issue in Update 4 and it was fixed in Update 5. Thanks guys!

Gennady_F_Intel
Moderator
198 Views

we are always welcome to help you :)

Reply