mkl_free_buffers() and multiple threads

bochkanov__sergey · ‎01-17-2018

Hello!

I have a question on behaviour of mkl_free_buffers() in multithreading setting. Is it OK if you have one thread which performs some activity with Intel MKL, and some other thread calls mkl_free_buffers() at the same time?

The reason I ask is that in some cases (heavily multithreaded C# TPL code calling MKL functions) NET framework spawns A LOT of temporary threads, which (when combined with fast MM feature of MKL) results in constantly leaking memory. So, I want to patch it by periodically calling mkl_free_buffers() from "garbage collector" thread. Say, calling it every 10 seconds.

MKL forum gives contradicting answers (some posters from Intel say that this function is thread-safe, and some users report errors associated with this function). And MKL manual ( https://software.intel.com/en-us/mkl-developer-reference-c-mkl-free-buffers ) says in its "Usage of mkl_free_buffers with FFT Functions" section that calling mkl_free_buffers() in the middle of working code will result in a failure.

Can you finally clarify this question? :)

Ying_H_Intel · ‎01-18-2018

Hi Sergey,

It seems your environment is more complex than the example in MKL manual.

As the manual mentioned. In a threaded application, avoid calling mkl_free_buffers from each thread because the function has a global effect. Call mkl_thread_free_buffers instead.

So basically you should call mkl_thread_free_buffers instead of mk_free_buffers in your multithreaded C# environment.

Further discussion about multithread and memory leak, if possible, we may recommend to use sequential MKL in your multithread environment and disable the fast memory as mkl developer gudie:

https://software.intel.com/en-us/mkl-windows-developer-guide-avoiding-memory-leaks-in-intel-mkl

When running, Intel MKL allocates and deallocates internal buffers to facilitate better performance. However, in some cases this behavior may result in memory leaks.

To avoid memory leaks, you can do either of the following:

Best Regards,

Ying

Set the MKL_DISABLE_FAST_MM environment variable to 1 or call the mkl_disable_fast_mm() function.

Be aware that this change may negatively impact performance of some Intel MKL functions, especially for small problem sizes.
Call the mkl_free_buffers() function or the mkl_thread_free_buffers() function in the current thread

bochkanov__sergey · ‎01-18-2018

Hi, Ying!

1. My actual question was "will MKL break if I call mkl_free_buffers() while it works?" Would you like to clarify this point?

Your reply basically states "do not call mkl_free_buffers() in multithreaded programs", but you do not tell me why. Is it performance - doing so will just degrade performance, without breaking down the program? Is it stability - doing so will destroy some structures currently in use?

So, I am asking - why? :)

2. The problem with "avoid calling mkl_free_buffers(), use mkl_thread_free_buffers()" is that it is not an option here.

MKL is used as linear algebra backend for another numerical library, which, in turn, is used by larger application. Due to performance considerations we can not call mkl_thread_free_buffers() every time we exit from our library. And calling thread_free_buffers when thread dies is also impossible: because larger application is written by other people, we do not know which call to our library is the last call for this particular thread.

Ying_H_Intel · ‎01-26-2018

Hi Sergey,

The program shouldn't break if you can mkl_free_buffers() when it works. Because the function frees only "unused" buffers for all threads .

Yes, it will degrade performance due to the cost of reallocation of buffers for subsequent calls to Intel MKL functions

I actually want to clarify what does leaking memory means. Is there’s a real memory leak?

As you see, When running, Intel MKL allocates and deallocates internal buffers to facilitate better performance. However,
in some cases this behavior may result in memory leaks. But once execution exit, the memory should be all freed and no memory leak. So when you mentioned the leak memory here, do you mean at the end of execution there are some memory not freed or caused by the mechanism of "Intel MKL allocates and deallocates internal buffers" ?

If it is the former, let's investigate what is real problem.

if the latter, you can do either of the following:
1. Set the MKL_DISABLE_FAST_MM environment variable to 1 or call the mkl_disable_fast_mm() function.
Be aware that this change may negatively impact performance of some Intel MKL functions, especially for
small problem sizes.
2. Call the mkl_free_buffers() function or the mkl_thread_free_buffers() function in the current
thread.
I may suggest to use disable fast mm, which should work too for your case and it will be better than constantly disrupt execution with cleaning buffer. You may select one depend on your test.

Best Regards

Ying

bochkanov__sergey · ‎01-26-2018

1. Thank you for answer!

By the way, it would be good to tweak manual entry on disable_fast_mm(), because it states (in its FFT example) that calling this function will break MKL.

2. On the "leak". I can not be 100% sure here, but...

2A. It looks like that it is not a leak in a strict meaning of this word. Intel MKL correctly keeps track of allocated/deallocated/cached memory (at least, I have no evidence for the opposite).

2B. This fast memory management feature seems to be unsuited for an environment which constantly creates and destroys new threads. Microsoft .NET Task Parallel Library (and entire .NET framework in general) is an example of such environment.

Although programmer may think that it has limited number of managed worker threads, internally TPL may create/destroy many native threads and constantly migrate execution context from one native thread to another one. Same managed thread may be represented by different underlying native threads in a different moments of its lifetime.

2C. Because buffers allocated for now-gone threads will survive until the end of the application, we may have situation when MKL keeps allocating new and new per-thread buffers, with old ones being unused (they were used once or twice for a native thread which is now gone).

Using thread_free_buffers() function is not an option, because this function should be called before thread terminates... but we do not know when it will terminate! Microsoft NET hides such implementation details from us.

2D. Finally, the problem is greatly amplified by the fact that MKL may allocate as much as 5MB of memory for a simple 64x64 * 64x64 GEMM product (I traced its malloc() calls). So, memory "leaks" really fast. 50GB of memory may be exhausted in less than hour.

And some applications are intended to be used for a month, or two, or even "eternally"... so even a minor "leak" will become a huge problem.

Ying_H_Intel · ‎02-01-2018

Hi Sergey,

Thank you for the details about the "leak". I will bring them into our developer team.

Regarding another way: call the mkl_disable_fast_mm() function, Have you tried it? (it may influence the performance, but may not so much )

Best Regards,

Ying

bochkanov__sergey · ‎02-02-2018

Hi, Ying!

Thank you for you reply!

1. Yes, I tried mkl_disable_fast_mm(). My use case involves repeated multiplication of small matrices, roughly 64x64, and it turns out that for such small sizes FastMM-free mode is as bad as malloc() implementation being used. And single threaded performance and multicore scaling are two separate questions.

Say, I never noticed any significant decrease at my 4-core Ubuntu desktop (GCC, glibc malloc), whether it was serial or multithreaded. But under Windows (MSVC, Microsoft CRT malloc) it started to scale badly with just 2 worker threads :(((((((((((

2. One important notice on the "leak". I have to tell that it is really hard to reproduce. I talked with two independent companies which were hit by the "leak", but was unable to reproduce it at my Windows box no matter what.

One of the companies used TSLab, .NET trading environment, 8-core system, moderate CPU load. "Leak" is present, memory leaks every second we use MKL functions. Another one performed a lot of high-load heavily threaded scientific calculations on 16-core 2-way hyperthreaded system, and some applications were hit by the "leak", some were not. It may depend on specific version of .NET framework being used, and specific usage of NET TPL features.

So, I described it as much as I can in my previous post... but it is possible that you will be unable to reproduce it at your servers.

Richard_G_5 · ‎09-26-2019

In case it is of any interest, we have hit this exact issue. We diagnosed it (eventually) and found the solution, and then came across this discussion whilst attempting to verify that mkl_free_buffers() is safe in a multithreaded system (of course it transpires there is a better way, using thread specific calls.)

Our system uses TPL calling into a native dll, which accesses MKL, on Windows. The buffer allocations being per thread, and allocated by MKL, no problems appeared in any of our unit tests or console applications, even ones driven through the API using .Net - until we called in using the TPL.
We proved the issue by catching the underlying native thread ids (which are different from the ones shown in .Net) and seeing the allocations rising with each additional thread id.

Calling mkl_free_buffers() does prevent the increase in memory allocation. However, we're implementing a per-thread buffer free, at the point the thread is released.

Sergey - thanks for your descriptions - they match precisely what we had determined we were seeing.

Ying - thanks for your information.

Found using MKL 2018, Win32, Intel Parallel Studio XE Composer Edition 2017, Windows SDK 10.17763, Visual Studio 2017 (v141), .Net 4.7, Windows 10

T__Petter · ‎12-05-2019

Edited, duplicate post

T__Petter · ‎12-06-2019

Hi Richard G,

After a couple of weeks debugging I finally found the cause of my apps memory problems after reading this post. I have exactly the same combination of TPL and MKL. In my case, mkl_free_buffers() seems to solve the problem, but your solution of freeing memory when a native thread is released seems even better. I can not, however, see how I can run code at the point the thread is released. Could you point me in the right direction?

Kind regards,

Petter T

Richard_G_5 · ‎12-13-2019

Hi T, Petter,

The solution we implemented, and which (so far as we can tell) works perfectly, is as follows:

in the Dll Main function, implement functions to handle thread and process detachment:

    switch (fdwReason)
    {
    case DLL_PROCESS_ATTACH:
        HandleProcessAttachment();
        break;
    case DLL_PROCESS_DETACH:
        HandleProcessDetachment();   // This function for when the process exits
        break;
    case DLL_THREAD_DETACH:
        HandleThreadDetachment();    // This function for when a thread exits
        break;
    }

In the function implementations, do the following (in addition to whatever else needs doing in them)

void HandleProcessDetachment() noexcept
{ 

   // ...

   mkl_free_buffers();

}

void HandleThreadDetachment()
{
    mkl_thread_free_buffers();
}

This way, each time a thread is detached, the buffer allocated for it is cleared, and when the process is detached, all buffers are cleared.
We see no negative performance impacts, and the memory leak issues no longer manifest.

HTH

Richard G