- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everybody!
I have got an issue using Intel MKL on a multithreaded application. I am using MKL 10.3.10 and POCO 1.6.1 on Visual Studio 2010, Windows 7 Professional, Intel Core i7-3770 CPU @ 3.40 GHz, 6 Gb RAM.
Basically a class method is called iterativelly, this method launches on a POCO thread another method of the same class to compute a matrix-matrix multiplication via sgemm. This processing class is build as a static library; here's the header code
#include <vector> #include "Poco\RunnableAdapter.h" #include "Poco\Thread.h" class Dummy { public: Dummy(); ~Dummy(); void start(); void compute(); std::vector<float> mat; Poco::Thread* pocoThread; };
and the source code
#include "myLib.hpp" #include "mkl.h" Dummy::Dummy():pocoThread(NULL) {} Dummy::~Dummy() { if(pocoThread != NULL) { delete pocoThread; pocoThread = NULL; } } void Dummy::start() { mkl_set_num_threads(1); Poco::RunnableAdapter<Dummy> runnable(*this,&Dummy::compute); if(pocoThread==NULL) pocoThread = new Poco::Thread(); pocoThread->setPriority(Poco::Thread::PRIO_HIGHEST); pocoThread->start(runnable); pocoThread->join(); //compute(); // without thread } void Dummy::compute() { int rows = 500; int cols = 500; mat.resize(rows*cols); for( int i = 0; i < rows*cols; ++i) mat = i; std::vector<float> resultr(rows*cols); char transpose = 'N'; float alphar = 1.0f; float betar = 1.0f; sgemm(&transpose, &transpose, &rows, &rows, &cols, &alphar, &mat[0], &rows, &mat[0], &rows, &betar, &resultr[0], &rows); }
This static library is linked to a simple main:
#include "myLib\myLib.hpp" int main() { Dummy dummy; for( int i = 0; i < 10000; ++i) { dummy.start(); } return 0; }
Linked libraries (in order):
mkl_intel_lp64.lib mkl_core.lib mkl_intel_thread.lib libiomp5md.lib PocoFoundationmd.lib
The point is that I see a small but uninterrupted memory growth. This behaviour is definitely unexpected. I have tried to call the same method without launching it on a thread and no memory issue is shown. I have tried to replace the sgemm call with some other base stuff not from MKL and to still run the method on a thread and I can't see any problem as well. It seems like the issue is related to the simultaneous usage of POCO threading and MKL routine.
The following images are the memory load without and with calling Dummy::compute() through POCO thread.
By the way, as you can see in the attached images there's a big difference in memory load between calling Dummy::compute() on a POCO thread or not. I don't know whether it's expected or not.
I have also tried to add calls to mkl_free_buffers and mkl_thread_free_buffers but it doesn't fix the problem.
Hope somebody can help me.
Thanks in advance
Massimiliano
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For each thread (ever) calling MKL, MKL allocates some memory for tracking.
Until MKL 11.3, this memory was not freed on thread exit.
Starting MKL 11.3, this memory is freed on thread exit under Linux and OS X -- that is the point of my previous point.
However under Windows, this cleanup is not implemented even in MKL 11.3 because of limitations of Windows API -- see documentation for RegisterWaitForSingleObject.
So, currently the only workaround for you is to limit the number of threads that ever call MKL -- the size can be arbitrary and does not need to correlate with MKL_NUM_THREADS.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Massimiliano,
This issue is caused by the MKL memory manager in MKL 10.3.
It has been fixed in MKL 11.3.
To avoid this issue with MKL 10.3, it is necessary to limit somehow the number of threads calling MKL using some software workaround like a thread pool.
Evgueni.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Evgueni,
thank you for your answer.
Actually I allocate only one POCO thread once in the program lifetime and I run on it Dummy::compute()
several times calling mkl_set_num_threads(1)
. Is it still the case of the issue you are talking about in the MKL memory manager in MKL 10.3?
Do I have to create a thread pool sized as the number of threads set with mkl_set_num_thread()
?
Regards,
Massimiliano
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For each thread (ever) calling MKL, MKL allocates some memory for tracking.
Until MKL 11.3, this memory was not freed on thread exit.
Starting MKL 11.3, this memory is freed on thread exit under Linux and OS X -- that is the point of my previous point.
However under Windows, this cleanup is not implemented even in MKL 11.3 because of limitations of Windows API -- see documentation for RegisterWaitForSingleObject.
So, currently the only workaround for you is to limit the number of threads that ever call MKL -- the size can be arbitrary and does not need to correlate with MKL_NUM_THREADS.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As suggested by Evgueni the workaround of a thread pool seems working.
Here's the header:
#include <vector> #include "Poco\RunnableAdapter.h" #include "Poco\ThreadPool.h" class Dummy { public: Dummy(); ~Dummy(); void start(); void compute(); std::vector<float> mat; Poco::ThreadPool* pocoThreadPool; };
and the source:
#include "myLib.hpp" #include "mkl.h" Dummy::Dummy():pocoThreadPool(NULL) {} Dummy::~Dummy() { if(pocoThreadPool != NULL) { delete pocoThreadPool; pocoThreadPool = NULL; } } void Dummy::start() { mkl_set_num_threads(2); Poco::RunnableAdapter<Dummy> runnable(*this,&Dummy::compute); if(pocoThreadPool==NULL) pocoThreadPool = new Poco::ThreadPool(); pocoThreadPool->startWithPriority(Poco::Thread::PRIO_HIGHEST,runnable); pocoThreadPool->joinAll(); } void Dummy::compute() { int rows = 500; int cols = 500; mat.resize(rows*cols); for( int i = 0; i < rows*cols; ++i) mat = i; std::vector<float> resultr(rows*cols); char transpose = 'N'; float alphar = 1.0f; float betar = 1.0f; sgemm(&transpose, &transpose, &rows, &rows, &cols, &alphar, &mat[0], &rows, &mat[0], &rows, &betar, &resultr[0], &rows); }
Thank you again!
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page