- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have some questions:
1) Should I be able to call MKL functions inside of pthreads with the Fast Memory Management enabled?
2) Should I need to do anything to clean up after the MKL functions before I kill my pthread?
I think the answer to number 2 is yes, you need to call mkl_thread_free_buffers() before you kill the thread. However, that function seems to not be working correctly (or maybe operator error?).
I have a simplified example program that shows a possible problem with mkl_thread_free_buffers(), but maybe the answers to these 2 questions will take care of that.
I am using running on SUSE using gcc with MKL version 10.2. However, a colleague in the office is seeing the same behavior with Visual Studio 2008 on Windows 7.
Much thanks,
Jason
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- call dgemm
- call mkl_mem_stat
- print output from mkl_mem_stat
- call mkl_thread_free_buffers()
- call mkl_mem_stat
- print output from mkl_mem_stat // expect to see 0 allocated bytes, but I do not.
[cpp]#include#include #include "mkl_types.h" #include "mkl_service.h" #include "mkl_blas.h" #include "omp.h" int main() { int i,m, n, k,j,g; double alpha,beta; char transa, transb; double *a, *b, *c; transa = transb = 'n'; m = n = k = 1000; alpha = 1; beta = 0; int nBuffers; long bAllocated; a = malloc(m*m*sizeof(double)); b = malloc(m*m*sizeof(double)); for(g = 0;g<2;g++) { for(i=0;i<3;i++) { for(j=0;j = 1.0; b = 1.0; } c = malloc(m*m*sizeof(double)); dgemm(&transa,&transb,&m,&n,&k,α,a,&m,b,&k,β,c,&m); bAllocated = mkl_mem_stat(&nBuffers); printf("%ld bytes allocated in %d buffersn",bAllocated,nBuffers); // Expect to see some buffers used for call to dgemm mkl_thread_free_buffers(); bAllocated = mkl_mem_stat(&nBuffers); printf("%ld bytes allocated in %d buffersn",bAllocated,nBuffers); // Should not be any allocated bytes left, correct? free(c); } mkl_free_buffers(); // Tried this to see if it would zero the buffers out printf("AFTER mkl_free_buffers() %ld bytes still allocated in %d buffersn",bAllocated,nBuffers); // Still seeing bytes in buffers } free(a); free(b); return(0); } [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for yourquestion and a testcase to reproduce the problem on 4-threads:
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
AFTER mkl_free_buffers() 3149952 bytes still allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
AFTER mkl_free_buffers() 3149952 bytes still allocated in 3 buffers
However, it contains some errors related to global/private variables usages in case of parallelization:
1) variable c should not be global because each thread saves memory in it what leads to memory leak
2)variables nBuffers and bAllocated also should not be global
3) call to mkl_mem_stat was missing at the thread end
Look at fixed testcasebelow (see lines with //!! marks):
#include
#include
#include "mkl_types.h"
#include "mkl_service.h"
#include "mkl_blas.h"
#include "omp.h"
int main()
{
int i,m, n, k,j,g;
double alpha,beta;
char transa, transb;
double *a, *b; //!! corrected
transa = transb = 'n';
m = n = k = 1000;
alpha = 1;
beta = 0;
a = malloc(m*m*sizeof(double));
b = malloc(m*m*sizeof(double));
for(g = 0;g<2;g++)
{
int nBuffers; //!! moved
long bAllocated; //!!moved
for(i=0;i<3;i++)
{
for(j=0;j
a
b
}
double *c = malloc(m*m*sizeof(double)); //!! corrected
dgemm(&transa,&transb,&m,&n,&k,α,a,&m,b,&k,β,c,&m);
bAllocated = mkl_mem_stat(&nBuffers);
printf("%ld bytes allocated in %d buffers\n",bAllocated,nBuffers); // Expect to see some buffers used for call to dgemm
mkl_thread_free_buffers();
bAllocated = mkl_mem_stat(&nBuffers);
printf("%ld bytes allocated in %d buffers\n",bAllocated,nBuffers); // Should not be any allocated bytes left, correct?
free(c);
}
mkl_free_buffers(); // Tried this to see if it would zero the buffers out
bAllocated = mkl_mem_stat(&nBuffers); //!! added
printf("AFTER mkl_free_buffers() %ld bytes still allocated in %d buffers\n",bAllocated,nBuffers); // Still seeing bytes in buffers
}
free(a);
free(b);
return(0);
}
After corrections I can see the following output on 4-threads:
% ./a.out
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
4199936 bytes allocated in 4 buffers
3149952 bytes allocated in 3 buffers
AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let you know after the analysis.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
bAllocated = mkl_mem_stat(&nBuffers);
printf("%ld bytes allocated in %d buffers\n",bAllocated,nBuffers); // Should not be any allocated bytes left, correct?
before free(c) line.
And I can see the following output
AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers
2099968 bytes allocated in 2 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
2099968 bytes allocated in 2 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
2099968 bytes allocated in 2 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers
2099968 bytes allocated in 2 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
2099968 bytes allocated in 2 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
2099968 bytes allocated in 2 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
1049984 bytes allocated in 1 buffers
AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers
Itconfirmed my idea that mkl_mem_stat gives global memory staticticsfor all threads.
So, 1 or 2 buffers are taken in other threads in your testcase even if current thread has no allocated buffers.
Maybe it makes sense to introduce mkl_mem_thread_stat function tosee only currentthread memory statictics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So, 1 or 2 buffers are taken in other threads in your testcase even if current thread has no allocated buffers."
First, thank you for taking the time to look at my problem and run the code. I am VERY grateful.
What it sounds like you are saying, is that the MKL dgemm function is creating buffers in each of the threads that it creates. And that mkl_thread_free_buffers will clear only the buffer in my main thread, from which it was called. However, this seems to be creating a problem of abandoning the buffers created by dgemm's internally created threads.
I took your fixed version of my program and made two small changes:
- expanded the outer g loop to 20000
- commented out the call to mkl_free_buffers()
AFTER mkl_free_buffers() 8539008 bytes still allocated in 9 buffers
14684703104 bytes allocated in 20541 buffers
8539008 bytes allocated in 9 buffers
14686847488 bytes allocated in 20544 buffers
8539008 bytes allocated in 9 buffers
14688991872 bytes allocated in 20547 buffers
8539008 bytes allocated in 9 buffers
The program eventually chews up all my memory and crashes. I repeated this experiment many times yesterday with all of my linking choices (libmkl_gnu_thread, libmkl_intel_thread, libiomp5, libgomp, fopemp, etc). Each time it would run for 30-40 minutes and then crash.
It appears that mkl_free_buffers is not safe to use with mkl blas/lapack functions. (At least in my test program).
Thanks,
Jason
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let us consider simpler case on your test when it runs sequentially but the only dgemm can create internal threads. So, call to mkl_thread_free_buffers will not free buffers in that threads because they are not current.
To do so it needs to add parallel section like as follows (using Intel compiler):
#pragma omp parallel
{
mkl_thread_free_buffers();
}
and after that check mkl_mem_stat.
On 4-threads output after adding such parallel section is:
4199936 bytes allocated in 4 buffers
0 bytes allocated in 0 buffers
4199936 bytes allocated in 4 buffers
0 bytes allocated in 0 buffers
4199936 bytes allocated in 4 buffers
0 bytes allocated in 0 buffers
AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers
4199936 bytes allocated in 4 buffers
0 bytes allocated in 0 buffers
4199936 bytes allocated in 4 buffers
0 bytes allocated in 0 buffers
4199936 bytes allocated in 4 buffers
0 bytes allocated in 0 buffers
AFTER mkl_free_buffers() 0 bytes still allocated in 0 buffers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
However, on Linux this does not seem to be the case. Running on my desktop machine with SUSE and my laptop with Ubuntu 10.4, using gcc and mkl 10.2, mkl_thread_free_buffers never works properly. I have tried putting mkl_thread_free_buffers inside a separate omp parallel section. I have tried putting it inside an omp parallel section that also contains the call to dgemm. I have even tried these various permutations with a call to mkl_set_num_threads(1) before this code. In all cases, mkl_thread_free_buffers works incorrectly.
It seems to actually free the buffer, but the internal memory manager continues counting up...i.e. 2 buffers, 3 buffers, etc. the program behaves fine until the program thinks it has more than 8192 buffers. At that point the machine begins to rapidly consume all the memory on my machine and crash. For example, it will consistently use 0.3 percent of my available memory (8 gigs in all cases) according to the command line utility "top". However, once it crosses the threshold of thinking it has 8192 buffers it will rapidly consume my remaining memory - according to top it will jump from 0.3 to 3 to 12 percent in seconds and continue growing like this until it crashes.
I downloaded a trial version of the intel compiler. I was thinking of purchasing it anyway, so good timing. However, using th latest version of the intel compiler with my 10.2 mkl libraries performs no better. I still have the same problems.
Before I post another example with output:
1) Is the current version of icc compatible with MKL version 10.2?
2) Should my mkl_thread_free_buffers statement be inside an omp parallel section containing the dgemm call, or can I just surround only the mkl_thread_free_buffers call with #pragma omp parallel { }?
Much thanks for all of your help!
Jason
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jason,
Itis very strange behaviour on your Linux and Ubuntu machines :(
In my experiments I used MKL 10.3.3 onLDE EL5S.U4: Linux 2.6.18-164.el5 #1 SMP x86_64 GNU/Linux
with Intel compiler and the test passed.
To seewhat is going on, please try to set env KMP_AFFINITY=verbose in case of Intel compiler.
As to your questions:
- the latest icc is to be compatible with MKL 10.2
- mkl_thread_free_buffers is to be used the follwing way:
dgemm(&transa,&transb,&m,&n,&k,α,a,&m,b,&k,β,c,&m);
bAllocated = mkl_mem_stat(&nBuffers);
printf("%ld bytes allocated in %d buffers\n",bAllocated,nBuffers);
#pragma omp parallel
{
mkl_thread_free_buffers();
}
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page