MKL buffer management and linking issues (incl crashes)

erling_andersen · ‎05-06-2013

Our application MOSEK links with the static version of MKL and we have some issues in that regard.

Note that our apllication is .so or DLL that is linked with other applications by our users and those users may also use MKL. For instance our application is linked to MATLAB (www.mathworks.com) that also uses MKL..

1. The first issue is that you say mkl_free_buffers can always be called i.e.

http://software.intel.com/en-us/forums/topic/277599

In our exprience that is not the case if MKL is called form multiple threads because then the application may crash. Should we always be able to mkl_free_buffers unconditoinally?

2. In Linux 64bit using Intel C 13.0.0 our application runs fine if we do not call mkl_thread_free_buffers (we use that function instead mkl_free_buffers because that has issues mention under 1). However, if we do call mkl_thread_free_buffers it crashes. Should it not always work?

3. It seems if we call mkl_disable_fasst_mm that the problems goes away. However, if we do that then is only our static MKL library affacted? Or is the users application also affected e.g. MATLAB.

4. Do you have any information about how to dela with situation where an application may me linked with two diffrent version of MKL. One static and one dynamic for instance.

To us seems the buffer management is major pain to get information about and figuring out how it works. Can you shed any light on the issues we having.

SergeyKostrov · ‎05-06-2013

>>...In our exprience that is not the case if MKL is called form multiple threads because then the application may crash... More technical details with a multi-threaded reproducer are needed to understand what could be wrong. Does it apply to mkl_free_buffers only or another MKL functions? >>...Do you have any information about how to dela with situation where an application may me linked with two diffrent >>version of MKL... I didn't try that. However, Runtime Binding of MKL functions from MKL DLLs of different versions is a better solution ( more flexible / allows to use any number of different MKL DLLs ) and it is based on a classic approach with a call to Win32 API function LoadLibrary followed by calls to GetProcAddress for all MKL functions which need to be used.

erling_andersen · ‎05-06-2013

We really prefer linking the sequential static version and not the DLL version. One reason is want to make sure we use the sequential version of MKL. If we use the DLL, then can we be 100% sure about that. Even if our library is linked into another application that uses MKL multithreaded. Also we we must not prevent the linking application from running multithreaded.

Regard mkl_free_buffers then this function generates a segmantation fault in some cases. It seems to happen if our applications runs two threads. Each thread uses MKL. Now one thread may finnish before the other and then call mkl_free_buffers and this seems to cause a crash.

If we do not use mkl_free_buffers everything run fine.

erling_andersen · ‎05-06-2013

Btw if you link using a DLL then how much will

mkl_diasable_fast_mm

affect. Assume you have a program that is linked directly with MKL and a third party DLL that is linked with MKL. Nor assume the third party DLL calls

mkl_diasblae_fast_mm

Will that also affect the MKL DLL linked directly into the program.

SergeyKostrov · ‎05-06-2013

>>...We really prefer linking the sequential static version and not the DLL version... The question was how two use several MKL libraries of different versions, right? Now, try to add several the same MKL libs of different versions and you will see what happens. Once again, I didn't try that and I could expect that a linker error a Simbol / Function is already defined will be displayed. With Runtime Binding approach you Do Not have any limitations. >>Regard mkl_free_buffers then this function generates a segmantation fault in some cases. It seems to happen if our applications >>runs two threads. Now one thread may finnish before the other and then call mkl_free_buffers and this seems to cause a crash. It would be nice if you provide a reproducer. What does it mean in '...in some cases...'?

erling_andersen · ‎05-06-2013

Let me reduce my questions to the following 2 questions.

1) What is effect of disable fast MM using

mkl_disable_fast_mm

on average in your expirience? (Our exprience is performance is decreased negilible).

2) Assuming you have a DLL (or .so) that link MKL statically. If you call the function

mkl_disable_fast_mm()

then will this only affect the DLLs usage of MKL? Or does it also affect the calling applications usage of MKL? In other words what is the scope of globale variable you modify with this function. We think it should only in the DLL but is not sure.

Note our DLL export only limited set of symbols and all others are made private.

SergeyKostrov · ‎05-06-2013

>>1) What is effect of disable fast MM using >> >>mkl_disable_fast_mm >> >>on average in your expirience? I will verify it and post results as soon as my test is completed.

SergeyKostrov · ‎05-07-2013

I don't see significant differences and here are outputs of just completed tests: [ Test 1 - MKL MM is On ] Sub-Test 1.1 - Runtime binding of MKL functions Dynamic library mkl_rt.dll loaded Initialization Done Sub-Test 1.3 Intel(R) Math Kernel Library Version 10.3.12 Product Build 20120831 for 32-bit applications Major version : 10 Minor version : 3 Update version : 12 MKL Memory Management is Turned On Sub-Test 3.2 - SGEMM Matrix multiplication C[ 2048x2048 ] = A[ 2048x2048 ] * B[ 2048x2048 ] Allocating memory for matrices ( 32-byte alignment ) Intializing matrix data - Started Intializing matrix data - Completed Measuring performance of SGEMM function Iteration 01 - Completed in 4.094 secs Iteration 02 - Completed in 4.000 secs Iteration 03 - Completed in 4.000 secs Iteration 04 - Completed in 3.984 secs Iteration 05 - Completed in 3.984 secs Deallocating memory Dynamic library mkl_rt.dll unloaded [ Test 2 - MKL MM is Off ] Sub-Test 1.1 - Runtime binding of MKL functions Dynamic library mkl_rt.dll loaded Initialization Done Sub-Test 1.3 Intel(R) Math Kernel Library Version 10.3.12 Product Build 20120831 for 32-bit applications Major version : 10 Minor version : 3 Update version : 12 MKL Memory Management is Turned Off Sub-Test 3.2 - SGEMM Matrix multiplication C[ 2048x2048 ] = A[ 2048x2048 ] * B[ 2048x2048 ] Allocating memory for matrices ( 32-byte alignment ) Intializing matrix data - Started Intializing matrix data - Completed Measuring performance of SGEMM function Iteration 01 - Completed in 4.000 secs Iteration 02 - Completed in 3.984 secs Iteration 03 - Completed in 3.985 secs Iteration 04 - Completed in 3.984 secs Iteration 05 - Completed in 3.985 secs Deallocating memory Dynamic library mkl_rt.dll unloaded

SergeyKostrov · ‎05-07-2013

It also doesn't matter what C++ compiler is used and performance results are very consistent: ... MKL Memory Management is Turned Off ... [ Intel C++ compiler ] ... Iteration 01 - Completed in 3.953 secs Iteration 02 - Completed in 3.954 secs Iteration 03 - Completed in 3.953 secs Iteration 04 - Completed in 3.953 secs Iteration 05 - Completed in 3.947 secs ... [ Microsoft C++ compiler ] ... Iteration 01 - Completed in 3.969 secs Iteration 02 - Completed in 3.953 secs Iteration 03 - Completed in 3.953 secs Iteration 04 - Completed in 3.953 secs Iteration 05 - Completed in 3.958 secs ... [ MinGW C++ compiler ] ... Iteration 01 - Completed in 3.969 secs Iteration 02 - Completed in 3.953 secs Iteration 03 - Completed in 3.953 secs Iteration 04 - Completed in 3.953 secs Iteration 05 - Completed in 3.953 secs ... [ Borland C++ compiler ] ... Iteration 01 - Completed in 3.968 secs Iteration 02 - Completed in 3.969 secs Iteration 03 - Completed in 3.953 secs Iteration 04 - Completed in 3.953 secs Iteration 05 - Completed in 3.959 secs

erling_andersen · ‎05-07-2013

You confirm our observation. We have therefore just disabled the fast memory management. Since that makes our application crash occasionally. Without it things seems run perfect.

Btw. the error handling i.e. figuring out when MKL is running out memory is also very ugly. This is more likely when fast MM is turned oof. I do understand that is something you have inherited from the original BLAS/Fortran.

SergeyKostrov · ‎05-08-2013

>>... the error handling i.e. figuring out when MKL is running out memory is also very ugly... Could you explain how you've implemented it with MKL functions? Also, it is possible to use Win32 API or WMI COM Interfaces to get numbers about how much memory is available at some moment between calls to MKL functions.

erling_andersen · ‎05-10-2013

We define a customer xerblas function. In case of an error it modfies a global flag. Ocassionaly we check the global flag.

However, if you run multiple dgemms in parallel then if just ome fails you will have to see them all as errors. Each function should return its own error code.

I would very much like for each blas call know whther it was an success or failed. And the reason why. I also want this work in in multithreaded cases. I mean I call dgemm in multiple threads.

SergeyKostrov · ‎05-11-2013

>>...However, if you run multiple dgemms in parallel then if just ome fails you will have to see them all as errors... Do you mean it happens in the application that has, for example, two threads running in parallel and both need to execute dgemm? Do they use the same input data or different? Please clarify.

erling_andersen · ‎05-12-2013

Background: Our application www.mosek.com solves optimizations problems e.g. linear programs. The engine requires we do a sparse Cholesky in every iteration. Now the parallelized sparse Cholesky employs sequential BLAS.

If we solve two LPs in parallel the sequential BLAS may be called from two different threads on inpdendent data. Although we have been informed that mkl_free_buffers can be called at any time then it sometimes make the application crash.

We have not tried to create a small example because we wanted to investogate if we have misunderstood something first. Now also disabling the buffer stuff seems to solve the issue.

SergeyKostrov · ‎05-13-2013

>>...If we solve two LPs in parallel the sequential BLAS may be called from two different threads on inpdendent data... If dgemm doesn't use any internal state variables or structures declared as global or static ( that is, outside of the function ) than it should work. Of cource, you need to create a small test case to prove it before starting a larger implementation work in a real application.

erling_andersen · ‎05-14-2013

I will work on test case when I have time. Since turning the buffer stuff off works very well it does not have a high priority right now. [I hope somebody else find the bug :-).] Thanks for your replies.