<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Sergey, in Intel® Integrated Performance Primitives</title>
    <link>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997847#M23003</link>
    <description>&lt;P&gt;Sergey,&lt;/P&gt;

&lt;P&gt;I performed additional tests for checking your suggestion.&lt;/P&gt;

&lt;P&gt;For read-only function ippiSum_32f_C1R I have following results:&lt;/P&gt;

&lt;P&gt;Intel Core i7-3770 CPU @ 3.40GHz&lt;BR /&gt;
	threads=1 time=596&lt;BR /&gt;
	threads=2 time=451&lt;BR /&gt;
	threads=3 time=429&lt;BR /&gt;
	threads=4 time=417&lt;/P&gt;

&lt;P&gt;Intel Xeon CPU E5-1660 0 @ 3.30GHz&lt;BR /&gt;
	threads=1 time=760&lt;BR /&gt;
	threads=2 time=440&lt;BR /&gt;
	threads=3 time=320&lt;BR /&gt;
	threads=4 time=270&lt;BR /&gt;
	threads=5 time=260&lt;BR /&gt;
	threads=6 time=250&lt;/P&gt;

&lt;P&gt;For write-only function ippiSet_32f_C1R I have:&lt;/P&gt;

&lt;P&gt;Intel Core i7-3770 CPU @ 3.40GHz&lt;BR /&gt;
	threads=1 time=539&lt;BR /&gt;
	threads=2 time=417&lt;BR /&gt;
	threads=3 time=419&lt;BR /&gt;
	threads=4 time=415&lt;/P&gt;

&lt;P&gt;Intel Xeon CPU E5-1660 0 @ 3.30GHz&lt;BR /&gt;
	threads=1 time=680&lt;BR /&gt;
	threads=2 time=360&lt;BR /&gt;
	threads=3 time=270&lt;BR /&gt;
	threads=4 time=230&lt;BR /&gt;
	threads=5 time=240&lt;BR /&gt;
	threads=6 time=240&lt;/P&gt;

&lt;P&gt;All results including speed-up ratio (single-threaded / multi-threaded) can be seen on following image:&lt;/P&gt;

&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="performance.png"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/7665i5BE596B3ED8A8D59/image-size/large?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="performance.png" alt="performance.png" /&gt;&lt;/span&gt;&lt;/P&gt;

&lt;P&gt;Now I understand that the bottleneck of the whole process is the the access to data stored in RAM. Am I right?&lt;BR /&gt;
	If so, is there any known technique which could be adapted to avoid memory transfer limitation and increase the parallelization level?&lt;/P&gt;

&lt;P&gt;Thank you in advanfor your comments.&lt;/P&gt;

&lt;P&gt;Greetings Krzysztof Piotrowski.&lt;/P&gt;</description>
    <pubDate>Mon, 06 Jul 2015 11:56:21 GMT</pubDate>
    <dc:creator>krzysztofpiotrowski</dc:creator>
    <dc:date>2015-07-06T11:56:21Z</dc:date>
    <item>
      <title>Single threaded IPP and external parallelization</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997841#M22997</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;I am implementing an application which uses single threaded IPP and external parallelization via MS OpenMP.&lt;/P&gt;

&lt;P&gt;Below you can find a piece of the source code which I used for some tests (the full code is attached to the post).&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;for (auto t = 1; t &amp;lt;= maxThreads; t++)
{
	auto start = clock();
	#pragma omp parallel default(shared) num_threads(t)
	{
		auto id = omp_get_thread_num();
		auto buffer = buffers[id];
		auto step = steps[id];

		#pragma omp for schedule(dynamic, 1)
		for (auto i = 0; i &amp;lt; count; i++)
			ippiDivC_32f_C1IR(1.0f, buffer, step, roi);
	}
	auto stop = clock();
	cout &amp;lt;&amp;lt; "threads=" &amp;lt;&amp;lt; t &amp;lt;&amp;lt; " time=" &amp;lt;&amp;lt; (stop - start) &amp;lt;&amp;lt; endl;
}
&lt;/PRE&gt;

&lt;P&gt;The code of application is very simple. It just checks an execution time of calculation using IPP depending on the number of threads used for this processing.&lt;/P&gt;

&lt;P&gt;For width=5000, height=5000 and count=100 I've obtained following results:&lt;/P&gt;

&lt;P&gt;Intel Core i7-3770 CPU @ 3.40GHz&lt;BR /&gt;
	version=7.0 build 205.58 name=ippie9_l.lib&lt;BR /&gt;
	threads=1 time=982&lt;BR /&gt;
	threads=2 time=947&lt;BR /&gt;
	threads=3 time=945&lt;BR /&gt;
	threads=4 time=957&lt;/P&gt;

&lt;P&gt;Intel Xeon CPU E5-1660 0 @ 3.30GHz&lt;BR /&gt;
	version=7.0 build 205.58 name=ippie9_l.lib&lt;BR /&gt;
	threads=1 time=988&lt;BR /&gt;
	threads=2 time=698&lt;BR /&gt;
	threads=3 time=679&lt;BR /&gt;
	threads=4 time=678&lt;BR /&gt;
	threads=5 time=678&lt;BR /&gt;
	threads=6 time=699&lt;/P&gt;

&lt;P&gt;As you can see it is very difficult to get any significant speed up using multiple threads. My question is what is the reason of above behavior? Could you please tell me what is the bottleneck of described solution?&lt;/P&gt;

&lt;P&gt;Thank you in advance for your help.&lt;/P&gt;

&lt;P&gt;Krzysztof Piotrowski.&lt;/P&gt;</description>
      <pubDate>Fri, 03 Jul 2015 08:48:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997841#M22997</guid>
      <dc:creator>krzysztofpiotrowski</dc:creator>
      <dc:date>2015-07-03T08:48:48Z</dc:date>
    </item>
    <item>
      <title>Hi Krzysztof Piotrowski,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997842#M22998</link>
      <description>&lt;P&gt;Hi Krzysztof Piotrowski,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;It seems the main work ( &lt;FONT face="Courier New"&gt;&amp;nbsp;&lt;CODE class="cpp plain"&gt;ippiDivC_32f_C1IR(1.0f, buffer, step, roi); )&amp;nbsp;does not get distributed to multiple threads but duplicated that it multiplies the amount of work as the number of threads gets larger. &lt;/CODE&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;Please refer&amp;nbsp;part 'Resizing a Tiled Image with One Prior Initialization' of this page ( &lt;A href="https://software.intel.com/en-us/node/504353#TILING_IMAGE_WITH_ONE_INIT"&gt;https://software.intel.com/en-us/node/504353#TILING_IMAGE_WITH_ONE_INIT&lt;/A&gt;&amp;nbsp;) , there explains the idea of 'parallelization in one direction'. This example shows how to multithread resize operation using OpenMP with parallelization in Y direction.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;Thank you.&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jul 2015 05:41:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997842#M22998</guid>
      <dc:creator>Jonghak_K_Intel</dc:creator>
      <dc:date>2015-07-06T05:41:19Z</dc:date>
    </item>
    <item>
      <title>Hi Jon J K.</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997843#M22999</link>
      <description>&lt;P&gt;Hi Jon J K.&lt;/P&gt;

&lt;P&gt;Thank you very much for answering my question.&lt;/P&gt;

&lt;P&gt;I understand the tilling approach in image processing. But this is not the case which I described. I am not interested for parallelization so deep where multiple threads work on one image to process it. In my example parallelization is done one level higher - multiple threads work on different data and process them independently.&lt;/P&gt;

&lt;P&gt;Maybe my example was not so clear but I would like to present it as simple as possible. Please imagine that the number of buffers is equal count and they are indexed in the loop by i (not by thread id). Also ippiDivC operation may be replaced by any sequence of IPP functions which work on independent buffers.&lt;/P&gt;

&lt;P&gt;You wrote that in my example the amount of work is getting larger when number of threads is increasing. I wouldn't agree with you - in my opinion the amount of work is constant, independent on number of threads and equals count. Using parallel for from OpenMP the whole work is just separated into multiple threads.&lt;/P&gt;

&lt;P&gt;The question is still open - what is the reason of weak parallelization level for larger number of threads.&lt;/P&gt;

&lt;P&gt;Thank you in advance for your support.&lt;/P&gt;

&lt;P&gt;Best regards, Krzysztof Piotrowski.&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jul 2015 07:19:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997843#M22999</guid>
      <dc:creator>krzysztofpiotrowski</dc:creator>
      <dc:date>2015-07-06T07:19:42Z</dc:date>
    </item>
    <item>
      <title>Hi Krzysztof,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997844#M23000</link>
      <description>&lt;P&gt;Hi Krzysztof, In opposite, I see almost 100%-effective speedup. In between start= and stop= for 4 threads you processes 4 buffers instead of 1 for 1 thread.&lt;/P&gt;

&lt;P&gt;Execution time is almost the same ~950 clocks (Core i7), but the amount of work is 4 times bigger.&lt;/P&gt;

&lt;P&gt;If you add processed byte count, you could see something like (just an example):&lt;/P&gt;

&lt;P&gt;threads=1 time=4753, bytes processed 2.5e+009&lt;BR /&gt;
	threads=2 time=2761, bytes processed 5e+009&lt;BR /&gt;
	threads=3 time=2781, bytes processed 7.5e+009&lt;BR /&gt;
	threads=4 time=2778, bytes processed 1e+010&lt;BR /&gt;
	threads=5 time=2779, bytes processed 1.25e+010&lt;BR /&gt;
	threads=6 time=2788, bytes processed 1.5e+010&lt;BR /&gt;
	threads=7 time=2785, bytes processed 1.75e+010&lt;BR /&gt;
	threads=8 time=2792, bytes processed 2e+010&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jul 2015 08:41:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997844#M23000</guid>
      <dc:creator>Sergey_K_Intel</dc:creator>
      <dc:date>2015-07-06T08:41:00Z</dc:date>
    </item>
    <item>
      <title>Hi Sergey,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997845#M23001</link>
      <description>&lt;P&gt;Hi Sergey.&lt;/P&gt;

&lt;P&gt;I cannot agree with you, the number of processed pixels is always the same width*height*count. It is because the number of all interations in the loop is equal count and OpenMP just dispatches the whole work into threads (see directive: &lt;CODE class="preprocessor"&gt;#pragma omp for&lt;/CODE&gt;).&lt;/P&gt;

&lt;P&gt;When number of threads is equal 1:&lt;BR /&gt;
	- thread 0 computes width*height*count pixels.&lt;/P&gt;

&lt;P&gt;When number of threads is equal 2 (*):&lt;BR /&gt;
	- thread 0 computes with*height*(count/2) pixels,&lt;BR /&gt;
	- thread 1 computes with*height*(count/2) pixels.&lt;/P&gt;

&lt;P&gt;When number of threads is equal n (*):&lt;BR /&gt;
	- thread 0 computes with*height*(count/n) pixels,&lt;BR /&gt;
	- thread 1 computes with*height*(count/n) pixels,&lt;BR /&gt;
	...&lt;BR /&gt;
	- thread n-1 computes with*height*(count/n) pixels.&lt;/P&gt;

&lt;P&gt;(*) This is a model situation when all threads are equally utilized and count % n == 0.&lt;/P&gt;

&lt;P&gt;I hope that now you will agree with me.&lt;/P&gt;

&lt;P&gt;Greetings, Krzysztof Piotrowski&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jul 2015 09:22:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997845#M23001</guid>
      <dc:creator>krzysztofpiotrowski</dc:creator>
      <dc:date>2015-07-06T09:22:41Z</dc:date>
    </item>
    <item>
      <title>Krzysztof,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997846#M23002</link>
      <description>&lt;P&gt;Krzysztof,&lt;/P&gt;

&lt;P&gt;You're right, sorry.&lt;/P&gt;

&lt;P&gt;So, the only explanation I could have is data race condition, when different threads write to the same locations.&lt;/P&gt;

&lt;P&gt;Change IPP read/write function to read-only, like:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Ipp64f sum = 0;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ippiSum_32f_C1R(buffer, step, roi, &amp;amp;sum, ippAlgHintAccurate);
&lt;/PRE&gt;

&lt;P&gt;and check.&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jul 2015 10:09:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997846#M23002</guid>
      <dc:creator>Sergey_K_Intel</dc:creator>
      <dc:date>2015-07-06T10:09:37Z</dc:date>
    </item>
    <item>
      <title>Sergey,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997847#M23003</link>
      <description>&lt;P&gt;Sergey,&lt;/P&gt;

&lt;P&gt;I performed additional tests for checking your suggestion.&lt;/P&gt;

&lt;P&gt;For read-only function ippiSum_32f_C1R I have following results:&lt;/P&gt;

&lt;P&gt;Intel Core i7-3770 CPU @ 3.40GHz&lt;BR /&gt;
	threads=1 time=596&lt;BR /&gt;
	threads=2 time=451&lt;BR /&gt;
	threads=3 time=429&lt;BR /&gt;
	threads=4 time=417&lt;/P&gt;

&lt;P&gt;Intel Xeon CPU E5-1660 0 @ 3.30GHz&lt;BR /&gt;
	threads=1 time=760&lt;BR /&gt;
	threads=2 time=440&lt;BR /&gt;
	threads=3 time=320&lt;BR /&gt;
	threads=4 time=270&lt;BR /&gt;
	threads=5 time=260&lt;BR /&gt;
	threads=6 time=250&lt;/P&gt;

&lt;P&gt;For write-only function ippiSet_32f_C1R I have:&lt;/P&gt;

&lt;P&gt;Intel Core i7-3770 CPU @ 3.40GHz&lt;BR /&gt;
	threads=1 time=539&lt;BR /&gt;
	threads=2 time=417&lt;BR /&gt;
	threads=3 time=419&lt;BR /&gt;
	threads=4 time=415&lt;/P&gt;

&lt;P&gt;Intel Xeon CPU E5-1660 0 @ 3.30GHz&lt;BR /&gt;
	threads=1 time=680&lt;BR /&gt;
	threads=2 time=360&lt;BR /&gt;
	threads=3 time=270&lt;BR /&gt;
	threads=4 time=230&lt;BR /&gt;
	threads=5 time=240&lt;BR /&gt;
	threads=6 time=240&lt;/P&gt;

&lt;P&gt;All results including speed-up ratio (single-threaded / multi-threaded) can be seen on following image:&lt;/P&gt;

&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="performance.png"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/7665i5BE596B3ED8A8D59/image-size/large?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="performance.png" alt="performance.png" /&gt;&lt;/span&gt;&lt;/P&gt;

&lt;P&gt;Now I understand that the bottleneck of the whole process is the the access to data stored in RAM. Am I right?&lt;BR /&gt;
	If so, is there any known technique which could be adapted to avoid memory transfer limitation and increase the parallelization level?&lt;/P&gt;

&lt;P&gt;Thank you in advanfor your comments.&lt;/P&gt;

&lt;P&gt;Greetings Krzysztof Piotrowski.&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jul 2015 11:56:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997847#M23003</guid>
      <dc:creator>krzysztofpiotrowski</dc:creator>
      <dc:date>2015-07-06T11:56:21Z</dc:date>
    </item>
    <item>
      <title>Krzysztof,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997848#M23004</link>
      <description>&lt;P&gt;Krzysztof,&lt;/P&gt;

&lt;P&gt;The best technique for successful parallelization is to make threads work with totally different data. Only in this case there will be no locks. Better if threads play in their own sandboxes for long time.&lt;/P&gt;

&lt;P&gt;The next important topic is amount of data your thread needs to work with. I am not expert in parallelization, but as I think, it's better if a particular thread will work with operational data (the area most often accessed by thread) not greater than L2 cache size / number of HW cores.&lt;/P&gt;

&lt;P&gt;There's something related to thread affinity in Intel OpenMP docs. Because, your case uses huge image size, there's a lot of cache misses, data transfers between cache and main memory. When a thread travelling from one CPU core to another - it happens if you haven't "nailed" thread to CPU core - the memory system needs to update the whole L1 cache for target core, and so on... I don't even know what happens.)).&lt;/P&gt;

&lt;P&gt;Working with big images is faster and better scalable with slices. But, for 5Kx5K images of float data, even slices might not help. Very roughly, if you have 8 core CPU with 8MB of L2 cache, the best slice size for 5K width data is about 8MB / 8 cores / 5000 / 4 (sizeof float) = 50 lines. Very roughly))&lt;/P&gt;

&lt;P&gt;Again it's not your case, if you want to parallel it higher level than simple slicing, but nevertheless keep the working data independent and compact.&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jul 2015 13:16:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Single-threaded-IPP-and-external-parallelization/m-p/997848#M23004</guid>
      <dc:creator>Sergey_K_Intel</dc:creator>
      <dc:date>2015-07-06T13:16:36Z</dc:date>
    </item>
  </channel>
</rss>

