Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Issue introduced in MKL 11.0 Update 4 (64-bit Linux only)

AndrewC
Novo colaborador III
2.076 Visualizações

After installing MKL 11.0 Update 4 over MKL 11.0 Update 2 on Linux our QA process is SIGSEGV at...

#0  0x00002aaab745874a in mkl_serv_malloc ()
 #1  0x00002aaab7f6bbcc in mkl_blas_mc3_dgemm_get_bufs ()
 #2  0x00002aaab6ae8a99 in mkl_blas_mc3_xdgemm_par ()
#3  0x00002aaab4c2cf74 in mkl_blas_xdgemm_par ()
 #4  0x00002aaab4b81ecb in mkl_blas_dgemm_2d_bsrc ()
 #5  0x00002aaab4b7b489 in gemm_host ()
 #6  0x00002aaabb92b4f3 in L_kmp_invoke_pass_parms ()
   from /opt/intel/composer_xe_2013.4.183/compiler/lib/intel64/libiomp5.so

100% reproducible in certain cases.

Reverting to MKL Update 2 solves the issue.

Seems to happen after many iterations , and many threads computation created/destroyed.

Note we are running multiple (boost) threads that call MKL. We call MKL_Thread_Free_Buffers at the completion of each thread.



0 Kudos
14 Respostas
Gennady_F_Intel
Moderador
2.076 Visualizações

Andrew, How can we reproduce the issue?

AndrewC
Novo colaborador III
2.076 Visualizações

The only way to reproduce is for Intel to have a copy of our software and an evaluation license from us. I will pursue this through premier support.

Gennady_F_Intel
Moderador
2.076 Visualizações

ok. we will take this issue as soon as you will submit it there

AndrewC
Novo colaborador III
2.076 Visualizações

OK, I created a ticket, but I said to reproduce Intel will have to download 400MB installer and license file but no response to that question.

No doubt, this will be a painful process for everyone to reproduce,but I cannot use MKL 11.0 Update 4 until this is resolved.

AndrewC
Novo colaborador III
2.076 Visualizações

Premier support issue # 697704

TimP
Colaborador honorário III
2.076 Visualizações

I hope you put some of the missing details in your issue submission.

I don't see any clues as to which checklists you have followed; there are several good ones, including

http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors

I can't even guess whether you explored simple remedies such as increasing stack (both global and thread stack) or using heap options.

SergeyKostrov
Contribuidor valorado II
2.076 Visualizações
>>...Seems to happen after many iterations... Do you have that SIGSEGV error after all threads released memory and completed ( destroyed )? Or in the middle, or at the end, of processing? This is what MSDN says about that very obsolete signal-error processing constant: ... SIGSEGV Illegal storage access. The default action terminates the calling program. ...
AndrewC
Novo colaborador III
2.076 Visualizações

Not sure what you mean by "obsolete"? On Linux, signals such as SIGSEGV are a fundamental part of the OS. A segementation violation can be caused by accessing an address that is illegal. Such as dereferencing a NULL pointer.

AndrewC
Novo colaborador III
2.076 Visualizações

TimP (Intel) wrote:

I hope you put some of the missing details in your issue submission.

I don't see any clues as to which checklists you have followed; there are several good ones, including

http://software.intel.com/en-us/articles/determining-root-cause-of-sigse...

I can't even guess whether you explored simple remedies such as increasing stack (both global and thread stack) or using heap options.

The details are that MKL 11 Update 2 passes 300-400 QA tests without failure, MKL Update 4 fails 6+ of those tests with a segmentation violation inside MKL, reproducibly.  I have supplied premier support with a reproducible example. I will update this thread with the results.

AndrewC
Novo colaborador III
2.076 Visualizações

Currently I am having to give the Premier support person a tutorial in GDB.

But heres a clue for anyone at Intel who cares about this issue.

Does this look like a race condition in MKL?

Thread 1 is crashing with a segmentation violation in....

#11 0x00002aaab75d40da in mkl_serv_malloc ()
   from /opt/intel/composer_xe_2013.4.183/mkl/lib/intel64/libmkl_core.so
#12 0x00002b93a4980aec in mkl_blas_mc3_dgemm_get_bufs ()

Thread 2 is calling

#0  0x00002aaab75dfe00 in mkl_blas_dgemm_set_blks_size ()

#1  0x00002aaab66135d9 in gemm_host ()

Shane_S_Intel
Funcionário
2.076 Visualizações

Hi Andrew, we definitely care and the local MKL team is now looking into the issue. We will report back once we have more information. -Shane

AndrewC
Novo colaborador III
2.076 Visualizações

 I just installed MKL 11 Update 5 and the problem has gone away....looks like someone found and fixed the isssue....

AndrewC
Novo colaborador III
2.076 Visualizações

To close the loop on this issue. Intel premier support confirmed there was an issue in Update 4 and it was fixed in Update 5. Thanks guys!

Gennady_F_Intel
Moderador
2.076 Visualizações

we are always welcome to help you :)

Responder