Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6956 Discussions

MKL Fast Memory and mkl_thread_free(), possible memory leak?

Jason_Jones1
Beginner
882 Views
I have a fairly large application that spins threads using pthreads. MKL functions, particularly the double precision blas and lapack functions, are called inside these threads. Previously, we were calling mkl_free_buffers() when all the threads returned. That seemed to work fine. However, the complexity of the application has increased and it is not possible to know when another thread might be running an MKL function.

I have some questions:

1) Should I be able to call MKL functions inside of pthreads with the Fast Memory Management enabled?
2) Should I need to do anything to clean up after the MKL functions before I kill my pthread?

I think the answer to number 2 is yes, you need to call mkl_thread_free_buffers() before you kill the thread. However, that function seems to not be working correctly (or maybe operator error?).

I have a simplified example program that shows a possible problem with mkl_thread_free_buffers(), but maybe the answers to these 2 questions will take care of that.

I am using running on SUSE using gcc with MKL version 10.2. However, a colleague in the office is seeing the same behavior with Visual Studio 2008 on Windows 7.

Much thanks,

Jason
0 Kudos
8 Replies
Jason_Jones1
Beginner
882 Views
Here is my simple test case in pseudo code. Full text is at the bottom of the post.
  1. call dgemm
  2. call mkl_mem_stat
  3. print output from mkl_mem_stat
  4. call mkl_thread_free_buffers()
  5. call mkl_mem_stat
  6. print output from mkl_mem_stat // expect to see 0 allocated bytes, but I do not.
Here is the output that I get from my printf's:
10683392 bytes allocated in 12 buffers
8539008 bytes allocated in 9 buffers
And if I run the same code in a for loop, the buffers seem to grow. Output looks like this:
10683392 bytes allocated in 12 buffers
8539008 bytes allocated in 9 buffers
12827776 bytes allocated in 15 buffers // more buffers, more bytes
8539008 bytes allocated in 9 buffers
14972160 bytes allocated in 18 buffers // more buffers, more bytes
8539008 bytes allocated in 9 buffers
My compile/link line looks like this:gcc -Wall mkl_memtest.c -Wl,--start-group -L. -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -fopenmp -lpthread
The code for this stripped down example is below.
[cpp]#include 
#include 
#include "mkl_types.h"
#include "mkl_service.h"
#include "mkl_blas.h"
#include "omp.h"

int main()
{

int i,m, n, k,j,g;
double alpha,beta;
char transa, transb;
double *a, *b, *c;
transa = transb = 'n';
m = n = k = 1000;
alpha = 1;
beta = 0;
int nBuffers;
long bAllocated;
a = malloc(m*m*sizeof(double));
b = malloc(m*m*sizeof(double));
for(g = 0;g<2;g++)
{
   for(i=0;i<3;i++)
   {
      for(j=0;j = 1.0;
         b = 1.0;
      }
      c = malloc(m*m*sizeof(double));

      dgemm(&transa,&transb,&m,&n,&k,α,a,&m,b,&k,β,c,&m);

      bAllocated = mkl_mem_stat(&nBuffers);
      printf("%ld bytes allocated in %d buffersn",bAllocated,nBuffers); // Expect to see some buffers used for call to dgemm

      mkl_thread_free_buffers();
      bAllocated = mkl_mem_stat(&nBuffers);
      printf("%ld bytes allocated in %d buffersn",bAllocated,nBuffers); // Should not be any allocated bytes left, correct?

      free(c);
   }
   mkl_free_buffers(); // Tried this to see if it would zero the buffers out
   printf("AFTER  mkl_free_buffers() %ld bytes still allocated in %d buffersn",bAllocated,nBuffers); // Still seeing bytes in buffers
}
free(a);
free(b);
return(0);
}
[/cpp]
0 Kudos
barragan_villanueva_
Valued Contributor I
882 Views
Jason,

Thanks for yourquestion and a testcase to reproduce the problem on 4-threads:

4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
AFTER mkl_free_buffers() 3149952 bytes still allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
AFTER mkl_free_buffers() 3149952 bytes still allocated in 3 buffers

However, it contains some errors related to global/private variables usages in case of parallelization:
1) variable c should not be global because each thread saves memory in it what leads to memory leak
2)variables nBuffers and bAllocated also should not be global
3) call to mkl_mem_stat was missing at the thread end

Look at fixed testcasebelow (see lines with //!! marks):

#include
#include
#include "mkl_types.h"
#include "mkl_service.h"
#include "mkl_blas.h"
#include "omp.h"

int main()
{

int i,m, n, k,j,g;
double alpha,beta;
char transa, transb;
double *a, *b; //!! corrected
transa = transb = 'n';
m = n = k = 1000;
alpha = 1;
beta = 0;
a = malloc(m*m*sizeof(double));
b = malloc(m*m*sizeof(double));
for(g = 0;g<2;g++)
{
int nBuffers; //!! moved
long bAllocated; //!!moved
for(i=0;i<3;i++)
{
for(j=0;j {
a = 1.0;
b = 1.0;
}
double *c = malloc(m*m*sizeof(double)); //!! corrected

dgemm(&transa,&transb,&m,&n,&k,α,a,&m,b,&k,β,c,&m);

bAllocated = mkl_mem_stat(&nBuffers);
printf("%ld bytes allocated in %d buffers\n",bAllocated,nBuffers); // Expect to see some buffers used for call to dgemm

mkl_thread_free_buffers();
bAllocated = mkl_mem_stat(&nBuffers);
printf("%ld bytes allocated in %d buffers\n",bAllocated,nBuffers); // Should not be any allocated bytes left, correct?

free(c);
}
mkl_free_buffers(); // Tried this to see if it would zero the buffers out
bAllocated = mkl_mem_stat(&nBuffers); //!! added
printf("AFTER mkl_free_buffers() %ld bytes still allocated in %d buffers\n",bAllocated,nBuffers); // Still seeing bytes in buffers
}
free(a);
free(b);
return(0);
}

After corrections I can see the following output on 4-threads:

% ./a.out
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers

0 Kudos
barragan_villanueva_
Valued Contributor I
882 Views
However, it needs more investigation to see if mkl_free_buffers/mkl_thread_free_buffers work correctly.
Let you know after the analysis.
0 Kudos
barragan_villanueva_
Valued Contributor I
882 Views
Well, I done the following experiment and added several timescall to mkl_mem_stat function:

bAllocated = mkl_mem_stat(&nBuffers);
printf("%ld bytes allocated in %d buffers\n",bAllocated,nBuffers); // Should not be any allocated bytes left, correct?

before free(c) line.

And I can see the following output

AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers
2099968 bytes allocated in 2 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
2099968 bytes allocated in 2 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
2099968 bytes allocated in 2 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers
2099968 bytes allocated in 2 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
2099968 bytes allocated in 2 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
2099968 bytes allocated in 2 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers

Itconfirmed my idea that mkl_mem_stat gives global memory staticticsfor all threads.
So, 1 or 2 buffers are taken in other threads in your testcase even if current thread has no allocated buffers.

Maybe it makes sense to introduce mkl_mem_thread_stat function tosee only currentthread memory statictics.
0 Kudos
Jason_Jones1
Beginner
882 Views
"Itconfirmed my idea that mkl_mem_stat gives global memory staticticsfor all threads.
So, 1 or 2 buffers are taken in other threads in your testcase even if current thread has no allocated buffers."

First, thank you for taking the time to look at my problem and run the code. I am VERY grateful.

What it sounds like you are saying, is that the MKL dgemm function is creating buffers in each of the threads that it creates. And that mkl_thread_free_buffers will clear only the buffer in my main thread, from which it was called. However, this seems to be creating a problem of abandoning the buffers created by dgemm's internally created threads.

I took your fixed version of my program and made two small changes:
  1. expanded the outer g loop to 20000
  2. commented out the call to mkl_free_buffers()
This leaves just one call to mkl_thread_free_buffers, once per inner i loop. Here is a sample of the output I get after running for a while.
AFTER mkl_free_buffers() 8539008 bytes still allocated in 9 buffers
14684703104 bytes allocated in 20541 buffers
8539008 bytes allocated in 9 buffers
14686847488 bytes allocated in 20544 buffers
8539008 bytes allocated in 9 buffers
14688991872 bytes allocated in 20547 buffers
8539008 bytes allocated in 9 buffers

The program eventually chews up all my memory and crashes. I repeated this experiment many times yesterday with all of my linking choices (libmkl_gnu_thread, libmkl_intel_thread, libiomp5, libgomp, fopemp, etc). Each time it would run for 30-40 minutes and then crash.

It appears that mkl_free_buffers is not safe to use with mkl blas/lapack functions. (At least in my test program).

Thanks,

Jason


0 Kudos
barragan_villanueva_
Valued Contributor I
882 Views
Jason,

Let us consider simpler case on your test when it runs sequentially but the only dgemm can create internal threads. So, call to mkl_thread_free_buffers will not free buffers in that threads because they are not current.

To do so it needs to add parallel section like as follows (using Intel compiler):
#pragma omp parallel
{
mkl_thread_free_buffers();
}

and after that check mkl_mem_stat.

On 4-threads output after adding such parallel section is:

4199936 bytes allocated in 4 buffers
0 bytes allocated in 0 buffers
4199936 bytes allocated in 4 buffers
0 bytes allocated in 0 buffers
4199936 bytes allocated in 4 buffers
0 bytes allocated in 0 buffers
AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers
4199936 bytes allocated in 4 buffers
0 bytes allocated in 0 buffers
4199936 bytes allocated in 4 buffers
0 bytes allocated in 0 buffers
4199936 bytes allocated in 4 buffers
0 bytes allocated in 0 buffers
AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers

0 Kudos
Jason_Jones1
Beginner
882 Views
I had my colleague on Windows run our example programs. On Windows using visual studio 2008, the examples all work as they seem to for you. In fact, mkl_thread_free_buffers appears to work as expected even if it is not inside an omp parallel section.

However, on Linux this does not seem to be the case. Running on my desktop machine with SUSE and my laptop with Ubuntu 10.4, using gcc and mkl 10.2, mkl_thread_free_buffers never works properly. I have tried putting mkl_thread_free_buffers inside a separate omp parallel section. I have tried putting it inside an omp parallel section that also contains the call to dgemm. I have even tried these various permutations with a call to mkl_set_num_threads(1) before this code. In all cases, mkl_thread_free_buffers works incorrectly.

It seems to actually free the buffer, but the internal memory manager continues counting up...i.e. 2 buffers, 3 buffers, etc. the program behaves fine until the program thinks it has more than 8192 buffers. At that point the machine begins to rapidly consume all the memory on my machine and crash. For example, it will consistently use 0.3 percent of my available memory (8 gigs in all cases) according to the command line utility "top". However, once it crosses the threshold of thinking it has 8192 buffers it will rapidly consume my remaining memory - according to top it will jump from 0.3 to 3 to 12 percent in seconds and continue growing like this until it crashes.

I downloaded a trial version of the intel compiler. I was thinking of purchasing it anyway, so good timing. However, using th latest version of the intel compiler with my 10.2 mkl libraries performs no better. I still have the same problems.

Before I post another example with output:

1) Is the current version of icc compatible with MKL version 10.2?
2) Should my mkl_thread_free_buffers statement be inside an omp parallel section containing the dgemm call, or can I just surround only the mkl_thread_free_buffers call with #pragma omp parallel { }?

Much thanks for all of your help!

Jason
0 Kudos
barragan_villanueva_
Valued Contributor I
882 Views

Jason,

Itis very strange behaviour on your Linux and Ubuntu machines :(

In my experiments I used MKL 10.3.3 onLDE EL5S.U4: Linux 2.6.18-164.el5 #1 SMP x86_64 GNU/Linux
with Intel compiler and the test passed.

To seewhat is going on, please try to set env KMP_AFFINITY=verbose in case of Intel compiler.

As to your questions:
- the latest icc is to be compatible with MKL 10.2
- mkl_thread_free_buffers is to be used the follwing way:

dgemm(&transa,&transb,&m,&n,&k,α,a,&m,b,&k,β,c,&m);

bAllocated = mkl_mem_stat(&nBuffers);
printf("%ld bytes allocated in %d buffers\n",bAllocated,nBuffers);

#pragma omp parallel
{
mkl_thread_free_buffers();
}

0 Kudos
Reply