Solved: MKL Memory Allocator

Arnaud · ‎08-09-2017

Hello, I am looking for information regarding the Memory Allocator embedded in MKL. We are using intensively MKL_malloc/MKL_free in a project and are planning to add a memory manager on top of it. Our goal is to reuse aligned memory without freeing it and to have per thread memory pools and to have the ability to do fine tuning on those memory pools. (We are indeed challenged by the memory consumption) The mkl_disable_fast_mm page refers to a per thread memory pool but with no more details. Does anyone have more information? (lock-free malloc with per thread heap? Monitoring the available memory in the memory pool, etc.) We are also considering a deactivation of mkl memory manager and rely on intel TBB malloc implementation (either by redefining memory function with i_malloc, or with std::vector> style implementation). Does anyone have feedback of such implementation? Thank you Arnaud

Zhen_Z_Intel · ‎08-10-2017

Hi,

Firstly, the mkl_malloc/mkl_free has same functionality of _aligned_malloc/_aligned_free, no meter windows or linux. It is only used for allocate memory for input/output data, not for buffers during the calculation. The buffer management during the calculation is encapsulated into MKL function, it is not open for developer, you could not access pointer of these buffer memory pool.

MKL only open some interface for setting some features of the buffer memory pool usage, like mkl_disable_fast_mm, mkl_free_buffer... MKL function itself memory allocator for buffer is actually use malloc, not like TBB, not concern about thread competition, because each thread malloc a memory space for buffer on each thread. And MKL calculation function will not free these memory for buffer when it finish the caculation, for example, if I call dgemm first, and then call daxpy, the buffer space for dgemm & daxpy still remain. The only way to release space for buffer is using mkl_free_buffer/mkl_thread_free_buffer. You could refer this example to see how MKL function inner buffer management works. MKL function do not free buffer space by default for improving performance that it may reuse buffer space for next MKL function.

Next, turn to TBB. TBB can be used for threading control for any C++ project, but would not affect inside buffer management of MKL functions. And the memory allocator for TBB is used for reduce competition for threads allocation from a single global heap(memory pool). With TBB, you could template Dojob class for allocating memory in scalable way or cache_align way.

Best regards,
Fiona

View solution in original post

Zhen_Z_Intel · ‎08-09-2017

Hi,

First of all, I would like to know if you use MKL function? The mkl_malloc is actually same as aligned_malloc, it means whatever mkl function or not, they all could access the memory pool allocated by mkl_malloc. However, the mkl_disable_fast_mm only control the MKL function, do not use themselves inner buffer allocator (i_malloc), but use malloc for buffers of MKL function. The problem is, you could not access these memory by other thread control(TBB/pthread), because it not provide point of these memory to you.

I am not very clear your purpose, and please well define the meaning of "per thread memory pool". Are you pointing to call malloc under each thread? Or TLS usage? if you do not use MKL function, you could totally use TBB thread control, and TBB provide concurrent container class which is lock free. I am not

I advice you to provide pseudocode to describe which kind of memory control you would like to use. What would be helpful understanding.

Best regards,
Fiona

Arnaud · ‎08-10-2017

Hi,

To give you a bit of context, we use cblas (mostly level 1 and 2) and VSL functions on vectors.

I would like to know if MKL_malloc/MKL_free are only proxy for _aligned_malloc/_aligned_free on windows platform, or do they include a buffer system (memory pool?) in such way to avoid unnecessary and repetitive malloc/free process.

In our context, we use sequential MKL and have 1 thread per core (our platform has 72 cores). Each thread dequeue jobs to do and each job look like this:

DoJob(args) :
MKL_malloc  //init of temporary vectors
// several calls to CBLAS or recursive call to DoJob
MKL_free // release of temporary vectors

When there are two successive calls to doJob, we would like to reuse the memory that should have been freed at the end of the first call for the MKL_malloc of the second call (is the vector is of the same size). We would like to minimize the contention due to allocation from a single heap (malloc and free are not lockless).

As far as I understand, this mechanism of buffer is in place for internal memory in function like DGEMM or FFT. And we can free the buffer with mkl_thread_free_buffers. I would like to know if it is also present in MKL_malloc/MKL_free and if it is a local buffer for each thread or a global buffer across all threads?

My point about TBB and per thread basis is that TBB offers a memory allocator that work on a per thread basis that minimize contention issued from repetitive malloc/free calls (almost lock-free malloc?). Depending on the memory management in place in MKL, we may be interested into switching to cache_aligned_allocator (the padding is 128 bytes therefor compatible with MKL functions). Any advice will be greatly appreciated.

Thank you for your help!

Arnaud

Zhen_Z_Intel · ‎08-10-2017

Hi,

Firstly, the mkl_malloc/mkl_free has same functionality of _aligned_malloc/_aligned_free, no meter windows or linux. It is only used for allocate memory for input/output data, not for buffers during the calculation. The buffer management during the calculation is encapsulated into MKL function, it is not open for developer, you could not access pointer of these buffer memory pool.

MKL only open some interface for setting some features of the buffer memory pool usage, like mkl_disable_fast_mm, mkl_free_buffer... MKL function itself memory allocator for buffer is actually use malloc, not like TBB, not concern about thread competition, because each thread malloc a memory space for buffer on each thread. And MKL calculation function will not free these memory for buffer when it finish the caculation, for example, if I call dgemm first, and then call daxpy, the buffer space for dgemm & daxpy still remain. The only way to release space for buffer is using mkl_free_buffer/mkl_thread_free_buffer. You could refer this example to see how MKL function inner buffer management works. MKL function do not free buffer space by default for improving performance that it may reuse buffer space for next MKL function.

Next, turn to TBB. TBB can be used for threading control for any C++ project, but would not affect inside buffer management of MKL functions. And the memory allocator for TBB is used for reduce competition for threads allocation from a single global heap(memory pool). With TBB, you could template Dojob class for allocating memory in scalable way or cache_align way.

Best regards,
Fiona