Community
cancel
Showing results for 
Search instead for 
Did you mean: 
AndrewC
New Contributor I
93 Views

Issue introduced in MKL 11.0 Update 4 (64-bit Linux only)

After installing MKL 11.0 Update 4 over MKL 11.0 Update 2 on Linux our QA process is SIGSEGV at...

#0  0x00002aaab745874a in mkl_serv_malloc ()
 #1  0x00002aaab7f6bbcc in mkl_blas_mc3_dgemm_get_bufs ()
 #2  0x00002aaab6ae8a99 in mkl_blas_mc3_xdgemm_par ()
#3  0x00002aaab4c2cf74 in mkl_blas_xdgemm_par ()
 #4  0x00002aaab4b81ecb in mkl_blas_dgemm_2d_bsrc ()
 #5  0x00002aaab4b7b489 in gemm_host ()
 #6  0x00002aaabb92b4f3 in L_kmp_invoke_pass_parms ()
   from /opt/intel/composer_xe_2013.4.183/compiler/lib/intel64/libiomp5.so

100% reproducible in certain cases.

Reverting to MKL Update 2 solves the issue.

Seems to happen after many iterations , and many threads computation created/destroyed.

Note we are running multiple (boost) threads that call MKL. We call MKL_Thread_Free_Buffers at the completion of each thread.



0 Kudos
14 Replies
Gennady_F_Intel
Moderator
93 Views

Andrew, How can we reproduce the issue?

AndrewC
New Contributor I
93 Views

The only way to reproduce is for Intel to have a copy of our software and an evaluation license from us. I will pursue this through premier support.

Gennady_F_Intel
Moderator
93 Views

ok. we will take this issue as soon as you will submit it there

AndrewC
New Contributor I
93 Views

OK, I created a ticket, but I said to reproduce Intel will have to download 400MB installer and license file but no response to that question.

No doubt, this will be a painful process for everyone to reproduce,but I cannot use MKL 11.0 Update 4 until this is resolved.

AndrewC
New Contributor I
93 Views

Premier support issue # 697704

TimP
Black Belt
93 Views

I hope you put some of the missing details in your issue submission.

I don't see any clues as to which checklists you have followed; there are several good ones, including

http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors

I can't even guess whether you explored simple remedies such as increasing stack (both global and thread stack) or using heap options.

SergeyKostrov
Valued Contributor II
93 Views

>>...Seems to happen after many iterations... Do you have that SIGSEGV error after all threads released memory and completed ( destroyed )? Or in the middle, or at the end, of processing? This is what MSDN says about that very obsolete signal-error processing constant: ... SIGSEGV Illegal storage access. The default action terminates the calling program. ...
AndrewC
New Contributor I
93 Views

Not sure what you mean by "obsolete"? On Linux, signals such as SIGSEGV are a fundamental part of the OS. A segementation violation can be caused by accessing an address that is illegal. Such as dereferencing a NULL pointer.

AndrewC
New Contributor I
93 Views

TimP (Intel) wrote:

I hope you put some of the missing details in your issue submission.

I don't see any clues as to which checklists you have followed; there are several good ones, including

http://software.intel.com/en-us/articles/determining-root-cause-of-sigse...

I can't even guess whether you explored simple remedies such as increasing stack (both global and thread stack) or using heap options.

The details are that MKL 11 Update 2 passes 300-400 QA tests without failure, MKL Update 4 fails 6+ of those tests with a segmentation violation inside MKL, reproducibly.  I have supplied premier support with a reproducible example. I will update this thread with the results.

AndrewC
New Contributor I
93 Views

Currently I am having to give the Premier support person a tutorial in GDB.

But heres a clue for anyone at Intel who cares about this issue.

Does this look like a race condition in MKL?

Thread 1 is crashing with a segmentation violation in....

#11 0x00002aaab75d40da in mkl_serv_malloc ()
   from /opt/intel/composer_xe_2013.4.183/mkl/lib/intel64/libmkl_core.so
#12 0x00002b93a4980aec in mkl_blas_mc3_dgemm_get_bufs ()

Thread 2 is calling

#0  0x00002aaab75dfe00 in mkl_blas_dgemm_set_blks_size ()

#1  0x00002aaab66135d9 in gemm_host ()

Shane_S_Intel
Employee
93 Views

Hi Andrew, we definitely care and the local MKL team is now looking into the issue. We will report back once we have more information. -Shane

AndrewC
New Contributor I
93 Views

 I just installed MKL 11 Update 5 and the problem has gone away....looks like someone found and fixed the isssue....

AndrewC
New Contributor I
93 Views

To close the loop on this issue. Intel premier support confirmed there was an issue in Update 4 and it was fixed in Update 5. Thanks guys!

Gennady_F_Intel
Moderator
93 Views

we are always welcome to help you :)

Reply