<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hi Evgueni, in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Memory-issue-on-multithreaded-application-using-threaded-Intel/m-p/1024920#M19847</link>
    <description>&lt;P&gt;Hi Evgueni,&lt;/P&gt;

&lt;P&gt;thank you for your answer.&lt;/P&gt;

&lt;P&gt;Actually I allocate only one POCO thread once in the program lifetime and I run on it &lt;CODE class="plain"&gt;Dummy::compute()&lt;/CODE&gt; several times calling &lt;CODE class="plain"&gt;mkl_set_num_threads(1)&lt;/CODE&gt;. Is it still the case of the issue you are talking about in the MKL memory manager in MKL 10.3?&lt;/P&gt;

&lt;P&gt;Do I have to create a thread pool sized as the number of threads set with &lt;CODE class="plain"&gt;mkl_set_num_thread()&lt;/CODE&gt;?&lt;/P&gt;

&lt;P&gt;Regards,&lt;/P&gt;

&lt;P&gt;Massimiliano&lt;/P&gt;</description>
    <pubDate>Tue, 15 Sep 2015 07:29:00 GMT</pubDate>
    <dc:creator>Massimiliano_B_1</dc:creator>
    <dc:date>2015-09-15T07:29:00Z</dc:date>
    <item>
      <title>Memory issue on multithreaded application using threaded Intel MKL</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Memory-issue-on-multithreaded-application-using-threaded-Intel/m-p/1024918#M19845</link>
      <description>&lt;P&gt;Hi everybody!&lt;/P&gt;

&lt;P&gt;I have got an issue using Intel MKL on a multithreaded application. I am using MKL 10.3.10 and POCO 1.6.1 on Visual Studio 2010, Windows 7 Professional, Intel Core i7-3770 CPU @ 3.40 GHz, 6 Gb RAM.&lt;/P&gt;

&lt;P&gt;Basically a class method is called iterativelly, this method launches on a POCO thread another method of the same class to compute a matrix-matrix multiplication via sgemm. This processing class is build as a static library; here's the header code&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;vector&amp;gt;
#include "Poco\RunnableAdapter.h"
#include "Poco\Thread.h"

class Dummy
{
public:
	Dummy();
	~Dummy();
	void start();
	void compute();

	std::vector&amp;lt;float&amp;gt; mat;
	Poco::Thread* pocoThread;
};&lt;/PRE&gt;

&lt;P&gt;and the source code&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include "myLib.hpp"
#include "mkl.h"

Dummy::Dummy():pocoThread(NULL) {}

Dummy::~Dummy()
{
	if(pocoThread != NULL)
	{
		delete pocoThread;
		pocoThread = NULL;
	}
}

void Dummy::start()
{
	mkl_set_num_threads(1);
	Poco::RunnableAdapter&amp;lt;Dummy&amp;gt; runnable(*this,&amp;amp;Dummy::compute);
	if(pocoThread==NULL) pocoThread = new Poco::Thread();
	pocoThread-&amp;gt;setPriority(Poco::Thread::PRIO_HIGHEST);
	pocoThread-&amp;gt;start(runnable);
	pocoThread-&amp;gt;join();
	//compute();		// without thread
}

void Dummy::compute()
{
	int rows = 500;
	int cols = 500;
	mat.resize(rows*cols);
	for( int i = 0; i &amp;lt; rows*cols; ++i) mat&lt;I&gt; = i;
	std::vector&amp;lt;float&amp;gt; resultr(rows*cols);

	char transpose = 'N';
	float alphar = 1.0f;
	float betar = 1.0f;
	sgemm(&amp;amp;transpose,
			&amp;amp;transpose,
			&amp;amp;rows,
			&amp;amp;rows,
			&amp;amp;cols,
			&amp;amp;alphar,
			&amp;amp;mat[0],
			&amp;amp;rows,
			&amp;amp;mat[0],
			&amp;amp;rows,
			&amp;amp;betar,
			&amp;amp;resultr[0],
			&amp;amp;rows);
}&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;This static library is linked to a simple main:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include "myLib\myLib.hpp"

int main()
{
	Dummy dummy;
	for( int i = 0; i &amp;lt; 10000; ++i)
	{
		dummy.start();
	}
	return 0;
}
&lt;/PRE&gt;

&lt;P&gt;Linked libraries (in order):&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;mkl_intel_lp64.lib
mkl_core.lib
mkl_intel_thread.lib
libiomp5md.lib
PocoFoundationmd.lib&lt;/PRE&gt;

&lt;P&gt;The point is that I see a &lt;STRONG&gt;small but uninterrupted memory growth&lt;/STRONG&gt;. This behaviour is definitely unexpected. I have tried to call the same method without launching it on a thread and no memory issue is shown. I have tried to replace the sgemm call with some other base stuff not from MKL and to still run the method on a thread and I can't see any problem as well. &lt;STRONG&gt;It seems like the issue is related to the simultaneous usage of POCO threading and MKL routine&lt;/STRONG&gt;.&lt;/P&gt;

&lt;P&gt;The following images are the memory load without and with calling Dummy::compute() through POCO thread.&lt;/P&gt;

&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="no_thread.jpg"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/7986i1703827B61A7D1BC/image-size/large?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="no_thread.jpg" alt="no_thread.jpg" /&gt;&lt;/span&gt;&lt;/P&gt;

&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="thread.jpg"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/7987i99B01B5803B90869/image-size/large?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="thread.jpg" alt="thread.jpg" /&gt;&lt;/span&gt;&lt;/P&gt;

&lt;P&gt;By the way, as you can see in the attached images there's a big difference in memory load between calling Dummy::compute() on a POCO thread or not. I don't know whether it's expected or not.&lt;/P&gt;

&lt;P&gt;I have also tried to add calls to mkl_free_buffers and mkl_thread_free_buffers but it doesn't fix the problem.&lt;/P&gt;

&lt;P&gt;Hope somebody can help me.&lt;BR /&gt;
	Thanks in advance&lt;/P&gt;

&lt;P&gt;Massimiliano&lt;/P&gt;</description>
      <pubDate>Mon, 14 Sep 2015 10:25:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Memory-issue-on-multithreaded-application-using-threaded-Intel/m-p/1024918#M19845</guid>
      <dc:creator>Massimiliano_B_1</dc:creator>
      <dc:date>2015-09-14T10:25:18Z</dc:date>
    </item>
    <item>
      <title>Hi Massimiliano,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Memory-issue-on-multithreaded-application-using-threaded-Intel/m-p/1024919#M19846</link>
      <description>&lt;P&gt;Hi &lt;U&gt;&lt;FONT color="#000080"&gt;Massimiliano,&lt;/FONT&gt;&lt;/U&gt;&lt;/P&gt;

&lt;P&gt;This issue is caused by the MKL memory manager in MKL 10.3.&lt;/P&gt;

&lt;P&gt;It has been fixed in MKL 11.3.&lt;/P&gt;

&lt;P&gt;To avoid this issue with MKL 10.3, it is necessary to limit somehow the number of threads calling MKL using some software workaround like a thread pool.&lt;/P&gt;

&lt;P&gt;Evgueni.&lt;/P&gt;</description>
      <pubDate>Tue, 15 Sep 2015 03:55:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Memory-issue-on-multithreaded-application-using-threaded-Intel/m-p/1024919#M19846</guid>
      <dc:creator>Evgueni_P_Intel</dc:creator>
      <dc:date>2015-09-15T03:55:20Z</dc:date>
    </item>
    <item>
      <title>Hi Evgueni,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Memory-issue-on-multithreaded-application-using-threaded-Intel/m-p/1024920#M19847</link>
      <description>&lt;P&gt;Hi Evgueni,&lt;/P&gt;

&lt;P&gt;thank you for your answer.&lt;/P&gt;

&lt;P&gt;Actually I allocate only one POCO thread once in the program lifetime and I run on it &lt;CODE class="plain"&gt;Dummy::compute()&lt;/CODE&gt; several times calling &lt;CODE class="plain"&gt;mkl_set_num_threads(1)&lt;/CODE&gt;. Is it still the case of the issue you are talking about in the MKL memory manager in MKL 10.3?&lt;/P&gt;

&lt;P&gt;Do I have to create a thread pool sized as the number of threads set with &lt;CODE class="plain"&gt;mkl_set_num_thread()&lt;/CODE&gt;?&lt;/P&gt;

&lt;P&gt;Regards,&lt;/P&gt;

&lt;P&gt;Massimiliano&lt;/P&gt;</description>
      <pubDate>Tue, 15 Sep 2015 07:29:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Memory-issue-on-multithreaded-application-using-threaded-Intel/m-p/1024920#M19847</guid>
      <dc:creator>Massimiliano_B_1</dc:creator>
      <dc:date>2015-09-15T07:29:00Z</dc:date>
    </item>
    <item>
      <title>For each thread (ever)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Memory-issue-on-multithreaded-application-using-threaded-Intel/m-p/1024921#M19848</link>
      <description>&lt;P&gt;For each thread (ever) calling MKL, MKL allocates some memory for tracking.&lt;/P&gt;

&lt;P&gt;Until MKL 11.3, this memory was not freed on thread exit.&lt;/P&gt;

&lt;P&gt;Starting MKL 11.3, this memory is freed on thread exit under Linux and OS X -- that is the point of my previous point.&lt;/P&gt;

&lt;P&gt;However under Windows, this cleanup is not implemented even in MKL 11.3 because of limitations of Windows API -- see documentation for RegisterWaitForSingleObject.&lt;/P&gt;

&lt;P&gt;So, currently the only workaround for you is to limit the number of threads that ever call MKL -- the size can be arbitrary and does not need to correlate with MKL_NUM_THREADS.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 15 Sep 2015 08:10:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Memory-issue-on-multithreaded-application-using-threaded-Intel/m-p/1024921#M19848</guid>
      <dc:creator>Evgueni_P_Intel</dc:creator>
      <dc:date>2015-09-15T08:10:16Z</dc:date>
    </item>
    <item>
      <title>As suggested by Evgueni the</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Memory-issue-on-multithreaded-application-using-threaded-Intel/m-p/1024922#M19849</link>
      <description>&lt;P&gt;As suggested by Evgueni the workaround of a thread pool seems working.&lt;/P&gt;

&lt;P&gt;Here's the header:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;vector&amp;gt;
#include "Poco\RunnableAdapter.h"
#include "Poco\ThreadPool.h"

class Dummy
{
public:
	Dummy();
	~Dummy();
	void start();
	void compute();

	std::vector&amp;lt;float&amp;gt; mat;
	Poco::ThreadPool* pocoThreadPool;
};&lt;/PRE&gt;

&lt;P&gt;and the source:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include "myLib.hpp"
#include "mkl.h"

Dummy::Dummy():pocoThreadPool(NULL) {}

Dummy::~Dummy()
{
	if(pocoThreadPool != NULL)
	{
		delete pocoThreadPool;
		pocoThreadPool = NULL;
	}
}

void Dummy::start()
{
	mkl_set_num_threads(2);
	Poco::RunnableAdapter&amp;lt;Dummy&amp;gt; runnable(*this,&amp;amp;Dummy::compute);
	if(pocoThreadPool==NULL) pocoThreadPool = new Poco::ThreadPool();
	pocoThreadPool-&amp;gt;startWithPriority(Poco::Thread::PRIO_HIGHEST,runnable);
	pocoThreadPool-&amp;gt;joinAll();
}

void Dummy::compute()
{
	int rows = 500;
	int cols = 500;
	mat.resize(rows*cols);
	for( int i = 0; i &amp;lt; rows*cols; ++i) mat&lt;I&gt; = i;
	std::vector&amp;lt;float&amp;gt; resultr(rows*cols);

	char transpose = 'N';
	float alphar = 1.0f;
	float betar = 1.0f;
	sgemm(&amp;amp;transpose,
			&amp;amp;transpose,
			&amp;amp;rows,
			&amp;amp;rows,
			&amp;amp;cols,
			&amp;amp;alphar,
			&amp;amp;mat[0],
			&amp;amp;rows,
			&amp;amp;mat[0],
			&amp;amp;rows,
			&amp;amp;betar,
			&amp;amp;resultr[0],
			&amp;amp;rows);
}&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Thank you again!&lt;/P&gt;</description>
      <pubDate>Tue, 15 Sep 2015 09:27:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Memory-issue-on-multithreaded-application-using-threaded-Intel/m-p/1024922#M19849</guid>
      <dc:creator>Massimiliano_B_1</dc:creator>
      <dc:date>2015-09-15T09:27:52Z</dc:date>
    </item>
  </channel>
</rss>

