<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Cache wouldn't matter here I in Intel® Integrated Performance Primitives</title>
    <link>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112708#M25470</link>
    <description>&lt;P&gt;Cache wouldn't matter here I believe if you are worried about multithreading.&lt;/P&gt;

&lt;P&gt;If cache works the same for all those experiments, fetching data from cache wouldn't make any difference.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I did test with 'for (i=0; i &amp;lt; 1; i ++ )' then the results were&lt;/P&gt;

&lt;P&gt;with 4 threads -&amp;gt; 0.016, with 2 -&amp;gt; 0.031, with 1&amp;nbsp; -&amp;gt; 0.063.&lt;/P&gt;

&lt;P&gt;did you try it yourself?&lt;/P&gt;</description>
    <pubDate>Tue, 10 May 2016 06:24:58 GMT</pubDate>
    <dc:creator>Jonghak_K_Intel</dc:creator>
    <dc:date>2016-05-10T06:24:58Z</dc:date>
    <item>
      <title>Multi Threading Performance in Multiplication of 2 Arrays / Images - Intel IPP</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112700#M25462</link>
      <description>&lt;P&gt;I'm using Intel IPP for multiplication of 2 Images (Arrays).&lt;BR /&gt;
	I'm using Intel IPP 8.2 which comes with Intel Composer 2015 Update 6.&lt;/P&gt;

&lt;P&gt;I created a simple function to multiply too large images (The whole project is attached, see below).&lt;BR /&gt;
	I wanted to see the gains using Intel IPP Multi Threaded Library.&lt;/P&gt;

&lt;P&gt;Here is the simple project (I also attached the complete project form Visual Studio):&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"

#include &amp;lt;ctime&amp;gt;
#include &amp;lt;iostream&amp;gt;

using namespace std;

const int height = 6000;
const int width  = 6000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};

int main()
{
    IppiSize size = {width, height};

    double start = clock();

    for (int i = 0; i &amp;lt; 200; i++)
        ippiMul_32f_C1R(mInput_image, 6000 * 4, mInput_image, 6000 * 4, mOutput_image, 6000 * 4, size); 

    double end = clock();
    double douration = (end - start) / static_cast&amp;lt;double&amp;gt;(CLOCKS_PER_SEC);

    cout &amp;lt;&amp;lt; douration &amp;lt;&amp;lt; endl;
    cin.get();

    return 0;
}&lt;/PRE&gt;

&lt;P&gt;I compiled this project once using Intel IPP Single Threaded and once using Intel IPP Multi Threaded.&lt;/P&gt;

&lt;P&gt;I tried different sizes of arrays and in all of them the Multi Threaded version yields no gains (Sometimes it is even slower).&lt;/P&gt;

&lt;P&gt;I wonder, how come there is no gain in this task with multi threading?&lt;BR /&gt;
	I know Intel IPP uses the AVX and I thought maybe the task becomes Memory Bounded?&lt;/P&gt;

&lt;P&gt;I tried another approach by using OpenMP manually to have Multi Threaded approach using Intel IPP Single Thread implementation.&lt;BR /&gt;
	This is the code:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"

#include &amp;lt;ctime&amp;gt;
#include &amp;lt;iostream&amp;gt;

using namespace std;

#include &amp;lt;omp.h&amp;gt;

const int height = 5000;
const int width  = 5000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};

int main()
{
    IppiSize size = {width, height};

    double start = clock();

    IppiSize blockSize = {width, height / 4};

    const int NUM_BLOCK = 4;
    omp_set_num_threads(NUM_BLOCK);

    Ipp32f*  in;
    Ipp32f*  out;

    //  ippiMul_32f_C1R(mInput_image, width * 4, mInput_image, width * 4, mOutput_image, width * 4, size);

    #pragma omp parallel            \
    shared(mInput_image, mOutput_image, blockSize) \
    private(in, out)
    {
        int id   = omp_get_thread_num();
        int step = blockSize.width * blockSize.height * id;
        in       = mInput_image  + step;
        out      = mOutput_image + step;
        ippiMul_32f_C1R(in, width * 4, in, width * 4, out, width * 4, blockSize);
    }

    double end = clock();
    double douration = (end - start) / static_cast&amp;lt;double&amp;gt;(CLOCKS_PER_SEC);

    cout &amp;lt;&amp;lt; douration &amp;lt;&amp;lt; endl;
    cin.get();

    return 0;
}&lt;/PRE&gt;

&lt;P&gt;The results were the same, again, no gain of performance.&lt;/P&gt;

&lt;P&gt;Is there a way to benefit from Multi Threading in this kind of task?&lt;BR /&gt;
	How can I validate whether a task becomes memory bounded and hence no benefit in parallelize it? Are there benefit to parallelize task of multiplying 2 arrays on CPU with AVX?&lt;/P&gt;

&lt;P&gt;The Computers I tried it on is based on Core i7 4770k (Haswell).&lt;/P&gt;

&lt;P&gt;Here is a link to the &lt;A href="https://onedrive.live.com/redir?resid=D34B85E597BC6CBE!5945&amp;amp;authkey=!ADBAPUeN0wDh9-M&amp;amp;ithint=file%2czip"&gt;Project in Visual Studio 2013&lt;/A&gt;.&lt;/P&gt;

&lt;P&gt;Thank You.&lt;/P&gt;</description>
      <pubDate>Mon, 02 May 2016 14:47:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112700#M25462</guid>
      <dc:creator>Royi</dc:creator>
      <dc:date>2016-05-02T14:47:35Z</dc:date>
    </item>
    <item>
      <title>Anyone?</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112701#M25463</link>
      <description>&lt;P&gt;Anyone?&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;By the way, it happens with the boxFilter implementation as well.&lt;BR /&gt;
	What's the point in the Multi Threaded implementation if there are no gains?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thank You.&lt;/P&gt;</description>
      <pubDate>Wed, 04 May 2016 06:20:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112701#M25463</guid>
      <dc:creator>Royi</dc:creator>
      <dc:date>2016-05-04T06:20:49Z</dc:date>
    </item>
    <item>
      <title>Hi Royi,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112702#M25464</link>
      <description>&lt;P&gt;Hi Royi,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;I assume you read about IPP's internal multithreading that it has been deprecated. &amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;and your OpenMP example works fine on my machine showing the benefit of using multi threads.&amp;nbsp; ( Changed a bit to match your first example,&amp;nbsp;a&amp;nbsp;'for' added for 200 iteration&amp;nbsp;)&lt;/P&gt;

&lt;P&gt;You can change "NUM_BLOCK = 4;" to see how the number of threads affect the results.&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;
#include "stdafx.h"
#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"

#include &amp;lt;ctime&amp;gt;
#include &amp;lt;iostream&amp;gt;

using namespace std;

#include &amp;lt;omp.h&amp;gt;

const int height = 5000;
const int width = 5000;
Ipp32f mInput_image[1 * width * height];
Ipp32f mOutput_image[1 * width * height] = { 0 };

int main()
{
&amp;nbsp;IppiSize size = { width, height };

&amp;nbsp;double start = clock();

&amp;nbsp;const int NUM_BLOCK = 4;

&amp;nbsp;IppiSize blockSize = { width, height / NUM_BLOCK };


&amp;nbsp;omp_set_num_threads(NUM_BLOCK);

&amp;nbsp;Ipp32f*&amp;nbsp; in;
&amp;nbsp;Ipp32f*&amp;nbsp; out;

&amp;nbsp;//&amp;nbsp; ippiMul_32f_C1R(mInput_image, width * 4, mInput_image, width * 4, mOutput_image, width * 4, size);
&amp;nbsp;int i;
&amp;nbsp;for (i = 0; i &amp;lt; 200; i++){
#pragma omp parallel&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; \
&amp;nbsp;shared(mInput_image, mOutput_image, blockSize) \
&amp;nbsp;private(in, out)
&amp;nbsp;&amp;nbsp;{
&amp;nbsp;&amp;nbsp;int id = omp_get_thread_num();
&amp;nbsp;&amp;nbsp;int step = blockSize.width * blockSize.height * id;
&amp;nbsp;&amp;nbsp;in = mInput_image + step;
&amp;nbsp;&amp;nbsp;out = mOutput_image + step;
&amp;nbsp;&amp;nbsp;ippiMul_32f_C1R(in, width * 4, in, width * 4, out, width * 4, blockSize);
&amp;nbsp;}
&amp;nbsp;}

&amp;nbsp;double end = clock();
&amp;nbsp;double douration = (end - start) / static_cast&amp;lt;double&amp;gt;(CLOCKS_PER_SEC);

&amp;nbsp;cout &amp;lt;&amp;lt; douration &amp;lt;&amp;lt; endl;
&amp;nbsp;cin.get();

&amp;nbsp;return 0;
}&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 09 May 2016 06:57:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112702#M25464</guid>
      <dc:creator>Jonghak_K_Intel</dc:creator>
      <dc:date>2016-05-09T06:57:55Z</dc:date>
    </item>
    <item>
      <title>Hi Jon,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112703#M25465</link>
      <description>&lt;P&gt;Hi Jon,&lt;/P&gt;

&lt;P&gt;Yes, we are aware the Multi Threaded libraries are deprecated.&lt;BR /&gt;
	I wish you reversed that, many of us need low level Multi Threaded functions.&lt;/P&gt;

&lt;P&gt;What is the run time gain you have?&lt;BR /&gt;
	Are you on fast Desktop CPU?&lt;BR /&gt;
	Could you try the same using filterBox?&lt;/P&gt;

&lt;P&gt;On our Haswell we have zero gains.&lt;BR /&gt;
	We'll try your code again to verify.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Thank You.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 09 May 2016 21:50:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112703#M25465</guid>
      <dc:creator>Royi</dc:creator>
      <dc:date>2016-05-09T21:50:39Z</dc:date>
    </item>
    <item>
      <title>Hi  Royi,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112704#M25466</link>
      <description>&lt;P&gt;Hi&amp;nbsp; Royi,&lt;/P&gt;

&lt;P&gt;with "NUM_BLOCK = 1" I get about 5.5xx&lt;/P&gt;

&lt;P&gt;with "NUM_BLOCK = 2" it is 2.6xx&lt;/P&gt;

&lt;P&gt;with "NUM_BLOCK = 4" it is 1.3xx&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;it is Haswell with 4 logical CPU count.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Could you let me know what filterBox is?&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 02:23:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112704#M25466</guid>
      <dc:creator>Jonghak_K_Intel</dc:creator>
      <dc:date>2016-05-10T02:23:41Z</dc:date>
    </item>
    <item>
      <title>Hi Jon,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112705#M25467</link>
      <description>&lt;P&gt;Hi Jon,&lt;/P&gt;

&lt;P&gt;First, what's the image size you're using?&lt;BR /&gt;
	Could you use 4000 x 4000 or bigger?&lt;/P&gt;

&lt;P&gt;Moreover, I don't understand your gains.&lt;BR /&gt;
	Is it the fastest for "NUM_BLOCK = 1"?&lt;BR /&gt;
	This means the slowest is "NUM_BLOCK = 4"?&lt;/P&gt;

&lt;P&gt;Box Filter is given by Intel IPP "FilterBox" -&amp;nbsp;https://software.intel.com/en-us/node/504124.&lt;/P&gt;

&lt;P&gt;Thank You.&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 05:09:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112705#M25467</guid>
      <dc:creator>Royi</dc:creator>
      <dc:date>2016-05-10T05:09:20Z</dc:date>
    </item>
    <item>
      <title>Hi Royi,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112706#M25468</link>
      <description>&lt;P&gt;Hi Royi,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;Sorry if I made it confusing, those numbers I put were the outputs of your example.&lt;/P&gt;

&lt;P&gt;So I meant that with NUM_BLOCK = 1 , I got about 5.5xx as the output and with NUM_BLOCK=4, I got about 1.3xx as output.&lt;/P&gt;

&lt;P&gt;which means NUM_BLOCK = 4 showed the fastest performance.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I also used 5000 x 5000 image in 200 loops, as written in the code above.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thanks !&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 05:19:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112706#M25468</guid>
      <dc:creator>Jonghak_K_Intel</dc:creator>
      <dc:date>2016-05-10T05:19:13Z</dc:date>
    </item>
    <item>
      <title>Hi Jon,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112707#M25469</link>
      <description>&lt;P&gt;Hi Jon,&lt;/P&gt;

&lt;P&gt;Could you try the code in its original form?&lt;BR /&gt;
	I suspect the loop makes the data available in cache.&lt;/P&gt;

&lt;P&gt;Could you try without the loop you added?&lt;/P&gt;

&lt;P&gt;Thank You.&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 05:38:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112707#M25469</guid>
      <dc:creator>Royi</dc:creator>
      <dc:date>2016-05-10T05:38:22Z</dc:date>
    </item>
    <item>
      <title>Cache wouldn't matter here I</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112708#M25470</link>
      <description>&lt;P&gt;Cache wouldn't matter here I believe if you are worried about multithreading.&lt;/P&gt;

&lt;P&gt;If cache works the same for all those experiments, fetching data from cache wouldn't make any difference.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I did test with 'for (i=0; i &amp;lt; 1; i ++ )' then the results were&lt;/P&gt;

&lt;P&gt;with 4 threads -&amp;gt; 0.016, with 2 -&amp;gt; 0.031, with 1&amp;nbsp; -&amp;gt; 0.063.&lt;/P&gt;

&lt;P&gt;did you try it yourself?&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 06:24:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Multi-Threading-Performance-in-Multiplication-of-2-Arrays-Images/m-p/1112708#M25470</guid>
      <dc:creator>Jonghak_K_Intel</dc:creator>
      <dc:date>2016-05-10T06:24:58Z</dc:date>
    </item>
  </channel>
</rss>

