topic Hi Evgueni, in Intel® oneAPI Math Kernel Library

Memory issue on multithreaded application using threaded Intel MKL

Massimiliano_B_1 — Mon, 14 Sep 2015 10:25:18 GMT

Hi everybody!

I have got an issue using Intel MKL on a multithreaded application. I am using MKL 10.3.10 and POCO 1.6.1 on Visual Studio 2010, Windows 7 Professional, Intel Core i7-3770 CPU @ 3.40 GHz, 6 Gb RAM.

Basically a class method is called iterativelly, this method launches on a POCO thread another method of the same class to compute a matrix-matrix multiplication via sgemm. This processing class is build as a static library; here's the header code

#include <vector>
#include "Poco\RunnableAdapter.h"
#include "Poco\Thread.h"

class Dummy
{
public:
	Dummy();
	~Dummy();
	void start();
	void compute();

	std::vector<float> mat;
	Poco::Thread* pocoThread;
};

and the source code

#include "myLib.hpp"
#include "mkl.h"

Dummy::Dummy():pocoThread(NULL) {}

Dummy::~Dummy()
{
	if(pocoThread != NULL)
	{
		delete pocoThread;
		pocoThread = NULL;
	}
}

void Dummy::start()
{
	mkl_set_num_threads(1);
	Poco::RunnableAdapter<Dummy> runnable(*this,&Dummy::compute);
	if(pocoThread==NULL) pocoThread = new Poco::Thread();
	pocoThread->setPriority(Poco::Thread::PRIO_HIGHEST);
	pocoThread->start(runnable);
	pocoThread->join();
	//compute();		// without thread
}

void Dummy::compute()
{
	int rows = 500;
	int cols = 500;
	mat.resize(rows*cols);
	for( int i = 0; i < rows*cols; ++i) mat = i;
	std::vector<float> resultr(rows*cols);

	char transpose = 'N';
	float alphar = 1.0f;
	float betar = 1.0f;
	sgemm(&transpose,
			&transpose,
			&rows,
			&rows,
			&cols,
			&alphar,
			&mat[0],
			&rows,
			&mat[0],
			&rows,
			&betar,
			&resultr[0],
			&rows);
}

This static library is linked to a simple main:

#include "myLib\myLib.hpp"

int main()
{
	Dummy dummy;
	for( int i = 0; i < 10000; ++i)
	{
		dummy.start();
	}
	return 0;
}

Linked libraries (in order):

mkl_intel_lp64.lib
mkl_core.lib
mkl_intel_thread.lib
libiomp5md.lib
PocoFoundationmd.lib

The point is that I see a small but uninterrupted memory growth. This behaviour is definitely unexpected. I have tried to call the same method without launching it on a thread and no memory issue is shown. I have tried to replace the sgemm call with some other base stuff not from MKL and to still run the method on a thread and I can't see any problem as well. It seems like the issue is related to the simultaneous usage of POCO threading and MKL routine.

The following images are the memory load without and with calling Dummy::compute() through POCO thread.

By the way, as you can see in the attached images there's a big difference in memory load between calling Dummy::compute() on a POCO thread or not. I don't know whether it's expected or not.

I have also tried to add calls to mkl_free_buffers and mkl_thread_free_buffers but it doesn't fix the problem.

Hope somebody can help me.
Thanks in advance

Massimiliano

Hi Massimiliano,

Evgueni_P_Intel — Tue, 15 Sep 2015 03:55:20 GMT

Hi Massimiliano,

This issue is caused by the MKL memory manager in MKL 10.3.

It has been fixed in MKL 11.3.

To avoid this issue with MKL 10.3, it is necessary to limit somehow the number of threads calling MKL using some software workaround like a thread pool.

Evgueni.

Hi Evgueni,

Massimiliano_B_1 — Tue, 15 Sep 2015 07:29:00 GMT

Hi Evgueni,

thank you for your answer.

Actually I allocate only one POCO thread once in the program lifetime and I run on it Dummy::compute() several times calling mkl_set_num_threads(1). Is it still the case of the issue you are talking about in the MKL memory manager in MKL 10.3?

Do I have to create a thread pool sized as the number of threads set with mkl_set_num_thread()?

Regards,

Massimiliano

For each thread (ever)

Evgueni_P_Intel — Tue, 15 Sep 2015 08:10:16 GMT

For each thread (ever) calling MKL, MKL allocates some memory for tracking.

Until MKL 11.3, this memory was not freed on thread exit.

Starting MKL 11.3, this memory is freed on thread exit under Linux and OS X -- that is the point of my previous point.

However under Windows, this cleanup is not implemented even in MKL 11.3 because of limitations of Windows API -- see documentation for RegisterWaitForSingleObject.

So, currently the only workaround for you is to limit the number of threads that ever call MKL -- the size can be arbitrary and does not need to correlate with MKL_NUM_THREADS.

As suggested by Evgueni the

Massimiliano_B_1 — Tue, 15 Sep 2015 09:27:52 GMT

As suggested by Evgueni the workaround of a thread pool seems working.

Here's the header:

#include <vector>
#include "Poco\RunnableAdapter.h"
#include "Poco\ThreadPool.h"

class Dummy
{
public:
	Dummy();
	~Dummy();
	void start();
	void compute();

	std::vector<float> mat;
	Poco::ThreadPool* pocoThreadPool;
};

and the source:

#include "myLib.hpp"
#include "mkl.h"

Dummy::Dummy():pocoThreadPool(NULL) {}

Dummy::~Dummy()
{
	if(pocoThreadPool != NULL)
	{
		delete pocoThreadPool;
		pocoThreadPool = NULL;
	}
}

void Dummy::start()
{
	mkl_set_num_threads(2);
	Poco::RunnableAdapter<Dummy> runnable(*this,&Dummy::compute);
	if(pocoThreadPool==NULL) pocoThreadPool = new Poco::ThreadPool();
	pocoThreadPool->startWithPriority(Poco::Thread::PRIO_HIGHEST,runnable);
	pocoThreadPool->joinAll();
}

void Dummy::compute()
{
	int rows = 500;
	int cols = 500;
	mat.resize(rows*cols);
	for( int i = 0; i < rows*cols; ++i) mat = i;
	std::vector<float> resultr(rows*cols);

	char transpose = 'N';
	float alphar = 1.0f;
	float betar = 1.0f;
	sgemm(&transpose,
			&transpose,
			&rows,
			&rows,
			&cols,
			&alphar,
			&mat[0],
			&rows,
			&mat[0],
			&rows,
			&betar,
			&resultr[0],
			&rows);
}

Thank you again!