<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic External Multi-threading not working for IPPI functions in Intel® Integrated Performance Primitives</title>
    <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765927#M210</link>
    <description>Igor,&lt;BR /&gt;&lt;BR /&gt;Yes, I agree that disabling the IPP threading and doing the threading in our own application is the better approach for us with our big images. When I do that for one of our functions I get a minor speedup (5%) with the two processors in my test system. Since both external threads are calling the same IPPI functions to process different slices of the same image, I think the cache limitations come in to play.&lt;BR /&gt;&lt;BR /&gt;Thanks for your help.&lt;BR /&gt;&lt;BR /&gt;Paul</description>
    <pubDate>Thu, 12 Jan 2012 18:02:03 GMT</pubDate>
    <dc:creator>paulsgauthier</dc:creator>
    <dc:date>2012-01-12T18:02:03Z</dc:date>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765914#M197</link>
      <description>I'm trying to speed up our image processing software that uses some image arithmetic functions from IPPI. I first attempted to upgrade to the latest IPP (7.0) and let it do the OpenMP threading internally. That did not work. On a Core 2 duo processor (Win7) I got no speed up at all for two threads over one (although Task Manager showed that both hardware processors are pegged at 100%). I followed all the suggstions that I could find from this forum but nothing worked.&lt;BR /&gt;&lt;BR /&gt;So now I've called ippSetNumThreads(1) to disable OpenMP and created two threads of my own that process either the top half of a 1280x960 image (thread 1) or the lower half (thread 2). I do this by simply giving the second thread an offset into the image and processing 960/2 or 480 lines.&lt;BR /&gt;&lt;BR /&gt;This also does not work and I can't imagine why not. The total execution time on this machine for a series of arithmetic functions is about 16 msec per loop whether I use a single thread to process the full image or two threads to process each half of the image.&lt;BR /&gt;&lt;BR /&gt;Can someone suggest what might be going on here?&lt;BR /&gt;&lt;BR /&gt;Paul Gauthier</description>
      <pubDate>Fri, 06 Jan 2012 19:09:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765914#M197</guid>
      <dc:creator>paulsgauthier</dc:creator>
      <dc:date>2012-01-06T19:09:57Z</dc:date>
    </item>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765915#M198</link>
      <description>&lt;P&gt;Other than the image processing process, is your application running other threads such as C# GUI or a control task? If this is the case then your other tasks might be sharing a core with IPP's image process threads, blocking each other (all depands on your application architecture and flow).&lt;BR /&gt;In such case you can only "feel" the speed up on quad core and above, where yourun IPP on separate cores from your other tasks cores.&lt;/P&gt;</description>
      <pubDate>Fri, 06 Jan 2012 20:43:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765915#M198</guid>
      <dc:creator>OKohl</dc:creator>
      <dc:date>2012-01-06T20:43:07Z</dc:date>
    </item>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765916#M199</link>
      <description>Thanks for the reply. Yes, it's true the two hardware processors are used by many other threads, in our application as well as in the operating system. But while the image processing is going on there are no cpu-intensive activities happening. The GUI thread is waiting for the user to press a button. According to Task Manager there are no other applications popping up to steal time.</description>
      <pubDate>Fri, 06 Jan 2012 20:56:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765916#M199</guid>
      <dc:creator>paulsgauthier</dc:creator>
      <dc:date>2012-01-06T20:56:18Z</dc:date>
    </item>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765917#M200</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1325900732765="55" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=304812" href="https://community.intel.com/en-us/profile/304812/" class="basic"&gt;paulsgauthier&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;EM&gt;...&lt;BR /&gt;So now I've called ippSetNumThreads(1) to disable OpenMP and created two threads of my own that process either the top half of a 1280x960 image (thread 1) or the lower half (thread 2). I do this by simply giving the second thread an offset into the image and processing 960/2 or 480 lines.&lt;BR /&gt;...&lt;BR /&gt;&lt;/EM&gt;&lt;STRONG&gt;[SergeyK] Did you try to call a&lt;SPAN style="text-decoration: underline;"&gt;SetThreadAffinityMask&lt;/SPAN&gt; Win32 API function for both threads? Ifa Thread1works on a CPU1, and a Thread2works on CPU2 there must be aperformance improvement.&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;This also does not work and I can't imagine why not. The total execution time on this machine for a series of arithmetic functions is about 16 msec per loop whether I use a single thread to process the full image or two threads to process each half of the image.&lt;BR /&gt;&lt;BR /&gt;Can someone suggest what might be going on here?&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;A simpleTest-Case would help to identify a problem.&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Sergey&lt;/P&gt;</description>
      <pubDate>Sat, 07 Jan 2012 01:54:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765917#M200</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-01-07T01:54:06Z</dc:date>
    </item>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765918#M201</link>
      <description>Two more things I would check:&lt;BR /&gt;&lt;BR /&gt;a. Image supplier thread: are the images supplied by a camera, is the camera using a callback that might consume CPU time (remember its the same application so its hard to detect on Task Manager)&lt;BR /&gt;&lt;BR /&gt;b. Not all IPPI fuctions are multithreaded, I would recheck the documentation.&lt;BR /&gt;&lt;BR /&gt;good luck</description>
      <pubDate>Sat, 07 Jan 2012 15:48:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765918#M201</guid>
      <dc:creator>OKohl</dc:creator>
      <dc:date>2012-01-07T15:48:37Z</dc:date>
    </item>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765919#M202</link>
      <description>Thanks OKohl, the images are already in memory. There's no camera callback involved. Also, the functions I'm calling are listed in the threaded function list.&lt;BR /&gt;&lt;BR /&gt;Sergey, I've not called &lt;B&gt;SetThreadAffinityMask&lt;/B&gt; but Task Manager is telling me both cores are fully engaged when my loop is running so I didn't think it necessary. I'll try it anyway.&lt;BR /&gt;&lt;BR /&gt;It's acting as if the IPPI functions have a Enter/LeaveCriticalSection in them preventing simultaneous execution.&lt;BR /&gt;&lt;BR /&gt;Could there be a problem with the two threads operating on the same memory image (one top-half, the other bottom-half)?&lt;BR /&gt;&lt;BR /&gt;Paul G.&lt;BR /&gt;</description>
      <pubDate>Sat, 07 Jan 2012 16:02:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765919#M202</guid>
      <dc:creator>paulsgauthier</dc:creator>
      <dc:date>2012-01-07T16:02:02Z</dc:date>
    </item>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765920#M203</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1325963124046="55" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=304812" href="https://community.intel.com/en-us/profile/304812/" class="basic"&gt;paulsgauthier&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;EM&gt;Thanks OKohl, the images are already in memory. There's no camera callback involved. Also, the functions I'm calling are listed in the threaded function list.&lt;BR /&gt;&lt;BR /&gt;Sergey, I've not called &lt;B&gt;SetThreadAffinityMask&lt;/B&gt; but Task Manager is telling me both cores are fully engaged when my loop is running so I didn't think it necessary. I'll try it anyway.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;STRONG&gt;[SergeyK] Don't forget to call '&lt;SPAN style="text-decoration: underline;"&gt;Sleep(0)&lt;/SPAN&gt;' just &lt;SPAN style="text-decoration: underline;"&gt;right after a call&lt;/SPAN&gt; to 'SetThreadAffinityMask(...)'&lt;BR /&gt;because a CPU needs some time.&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;It's acting as if the IPPI functions have a Enter/LeaveCriticalSection in them preventing simultaneous execution.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;STRONG&gt;[SergeyK] It would be nice to hear some technical detailsfrom IPP's software developers.&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;Could there be a problem with the two threads operating on the same memory image (one top-half, the other bottom-half)?&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;STRONG&gt;[SergeyK] I don't think so. I used the &lt;SPAN style="text-decoration: underline;"&gt;same technique&lt;/SPAN&gt; to doa linear algebraprocessing for a&lt;BR /&gt;matrixon twoCPUs.&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;Paul G.&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Sergey&lt;/P&gt;</description>
      <pubDate>Sat, 07 Jan 2012 19:14:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765920#M203</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-01-07T19:14:37Z</dc:date>
    </item>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765921#M204</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;1) IPP functions don't use Enter/LeaveCriticalSection&lt;BR /&gt;2) could you provide a list of functions you use - not all IPP functions have internal threading&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt;Igor</description>
      <pubDate>Sun, 08 Jan 2012 10:23:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765921#M204</guid>
      <dc:creator>igorastakhov</dc:creator>
      <dc:date>2012-01-08T10:23:26Z</dc:date>
    </item>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765922#M205</link>
      <description>Igor,&lt;BR /&gt;&lt;BR /&gt;Here's the info about the IPPI lib I'm using:&lt;BR /&gt;&lt;BR /&gt;CPU : v8&lt;BR /&gt;Name : ippiv8-7.0.dll&lt;BR /&gt;Version : 7.0 build 205.85&lt;BR /&gt;Build date: Nov 26 2011&lt;BR /&gt;&lt;BR /&gt;I'm using the following IPP calls:&lt;BR /&gt;
&lt;BR /&gt;
  ippiSub_32f_C1R&lt;BR /&gt;

 ippiSqr_32f_C1R&lt;BR /&gt;

  ippiAdd_32f_C1IR&lt;BR /&gt;

 ippiSqrt_32f_C1R&lt;BR /&gt;
&lt;BR /&gt;I've included my little test program below. When I set the number of cores to use to 2 (on my 2-core system) I get the following output:&lt;BR /&gt;&lt;BR /&gt;Ipp init: ippStsNoErr: No error, it's OK&lt;BR /&gt;Number of cores = 2&lt;BR /&gt;Testing with 2 processors&lt;BR /&gt;2000 Iterations in 30.086 seconds (15.043 msec per iteration)&lt;BR /&gt;&lt;BR /&gt;When I set the number of cores to use to 1, I get the following output:&lt;BR /&gt;&lt;BR /&gt;Ipp init: ippStsNoErr: No error, it's OK&lt;BR /&gt;Number of cores = 2&lt;BR /&gt;Testing with 1 processors&lt;BR /&gt;2000 Iterations in 30.251 seconds (15.1255 msec per iteration)&lt;BR /&gt;&lt;BR /&gt;As you can see, in both cases the average time per loop is about the same.&lt;BR /&gt;&lt;BR /&gt;__________________________________&lt;BR /&gt;&lt;BR /&gt;Here's the entire test program. It just allocates some 1280x960 images and then calculates:&lt;BR /&gt;&lt;BR /&gt;Result = sqrt( sqr(A - B) + sqr(C - D) )&lt;BR /&gt;&lt;BR /&gt;__________________________________________________________&lt;BR /&gt;&lt;BR /&gt;int _tmain(int argc, _TCHAR* argv[])&lt;BR /&gt;{&lt;BR /&gt; int Width = 1280, Height = 960;&lt;BR /&gt; IppiSize iSize; iSize.width = Width; iSize.height = Height;&lt;BR /&gt; float* pIm[8];&lt;BR /&gt; for (int i = 0 ; i &amp;lt; 9 ; i++)&lt;BR /&gt; {&lt;BR /&gt;  pIm&lt;I&gt; = (float*)new float[Width * Height];&lt;BR /&gt;  ippiSet_32f_C1R((float)100, pIm&lt;I&gt;, Width*sizeof(float), iSize);&lt;BR /&gt; }&lt;BR /&gt; libInfo();&lt;BR /&gt; ippInit();&lt;BR /&gt; IppStatus sts;&lt;BR /&gt; sts = ippInitCpu(ippCpuC2D);&lt;BR /&gt; cout &amp;lt;&amp;lt; "Ipp init: " &amp;lt;&amp;lt; ippGetStatusString( sts ) &amp;lt;&amp;lt; endl;&lt;BR /&gt; cout &amp;lt;&amp;lt; "Number of cores = " &amp;lt;&amp;lt; ippGetNumCoresOnDie() &amp;lt;&amp;lt; endl;&lt;BR /&gt; int NumProcessors = 1;&lt;BR /&gt; ippSetNumThreads(NumProcessors);&lt;BR /&gt; ippGetNumThreads(&amp;amp;NumProcessors);&lt;BR /&gt; cout &amp;lt;&amp;lt; "Testing with " &amp;lt;&amp;lt; NumProcessors &amp;lt;&amp;lt; " processors" &amp;lt;&amp;lt; endl;&lt;BR /&gt; clock_t tStart = clock();&lt;BR /&gt; int NumRepeats = 2000;&lt;BR /&gt; for (int i = 0 ; i &amp;lt; NumRepeats ; i++)&lt;BR /&gt; {&lt;BR /&gt;  // Ref1 = Ref1 = A - B&lt;BR /&gt;  ippiSub_32f_C1R(pIm[0], Width*sizeof(float),pIm[1], Width*sizeof(float),pIm[2], Width*sizeof(float), iSize);&lt;BR /&gt;  // Ref1 = Ref2 = C - D&lt;BR /&gt;  ippiSub_32f_C1R(pIm[3], Width*sizeof(float),pIm[4], Width*sizeof(float),pIm[5], Width*sizeof(float), iSize);&lt;BR /&gt;  // Temp3 = Sqr(Ref1), Temp4 = Sqr(Ref2)&lt;BR /&gt; ippiSqr_32f_C1R(pIm[2], Width*sizeof(float), pIm[6], Width*sizeof(float), iSize);&lt;BR /&gt; ippiSqr_32f_C1R(pIm[5], Width*sizeof(float), pIm[7], Width*sizeof(float), iSize);&lt;BR /&gt;  // Result = Sqrt(Temp1 + Temp2)&lt;BR /&gt;  ippiAdd_32f_C1IR(pIm[6], Width*sizeof(float), pIm[7], Width*sizeof(float), iSize);&lt;BR /&gt; ippiSqrt_32f_C1R(pIm[7], Width*sizeof(float), pIm[8], Width*sizeof(float), iSize);&lt;BR /&gt; }&lt;BR /&gt; clock_t tEnd = clock();&lt;BR /&gt; double tSec = double(tEnd-tStart)/CLOCKS_PER_SEC;&lt;BR /&gt; double tMsec = 1000.0*tSec / NumRepeats;&lt;BR /&gt; cout &amp;lt;&amp;lt; NumRepeats &amp;lt;&amp;lt;" Iterations in " &amp;lt;&amp;lt; tSec &amp;lt;&amp;lt; " seconds (" &amp;lt;&amp;lt; tMsec &amp;lt;&amp;lt; " msec per iteration)" &amp;lt;&amp;lt; endl; &lt;BR /&gt; getchar();&lt;BR /&gt; for (int i = 0 ; i &amp;lt; 9 ; i++)&lt;BR /&gt;  delete[] pIm&lt;I&gt;;&lt;BR /&gt; return 0;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;Paul G.&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;</description>
      <pubDate>Sun, 08 Jan 2012 17:58:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765922#M205</guid>
      <dc:creator>paulsgauthier</dc:creator>
      <dc:date>2012-01-08T17:58:43Z</dc:date>
    </item>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765923#M206</link>
      <description>thank you, Paul,&lt;BR /&gt;&lt;BR /&gt;I'll test your example and come back soon,&lt;BR /&gt;attached is the list of threaded functions - all functions you use are threaded - so I'll try to find the problem.&lt;BR /&gt;&lt;BR /&gt;(sorry, don't know how to attach txt file - so only few lines)&lt;BR /&gt;...............&lt;BR /&gt;&lt;P&gt;ippiAddC_16s_C1IRSfs&lt;/P&gt;&lt;P&gt;ippiAddC_16s_C1RSfs&lt;/P&gt;&lt;P&gt;ippiAddC_16s_C3IRSfs&lt;/P&gt;&lt;P&gt;ippiAddC_16s_C3RSfs&lt;/P&gt;&lt;P&gt;ippiAddC_32f_AC4IR&lt;/P&gt;&lt;P&gt;ippiAddC_32f_C1R&lt;/P&gt;&lt;P&gt;ippiAddC_32f_C3IR&lt;/P&gt;&lt;P&gt;ippiAddC_32f_C3R&lt;/P&gt;&lt;P&gt;ippiAddC_8u_C1IRSfs&lt;/P&gt;&lt;P&gt;ippiAddC_8u_C1RSfs&lt;/P&gt;&lt;P&gt;ippiAddC_8u_C3IRSfs&lt;/P&gt;&lt;P&gt;ippiAddC_8u_C3RSfs&lt;/P&gt;&lt;P&gt;ippiAddProduct_16u32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAddProduct_16u32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAddProduct_32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAddProduct_32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAddProduct_8s32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAddProduct_8s32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAddProduct_8u32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAddProduct_8u32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAddSquare_16u32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAddSquare_16u32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAddSquare_32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAddSquare_32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAddSquare_8s32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAddSquare_8s32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAddSquare_8u32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAddSquare_8u32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAddWeighted_16u32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAddWeighted_16u32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAddWeighted_32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAddWeighted_32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAddWeighted_8s32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAddWeighted_8s32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAddWeighted_8u32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAddWeighted_8u32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAdd_16s_C1IRSfs&lt;/P&gt;&lt;P&gt;ippiAdd_16s_C1RSfs&lt;/P&gt;&lt;P&gt;ippiAdd_16s_C3IRSfs&lt;/P&gt;&lt;P&gt;ippiAdd_16s_C3RSfs&lt;/P&gt;&lt;P&gt;ippiAdd_16s_C4IRSfs&lt;/P&gt;&lt;P&gt;ippiAdd_16s_C4RSfs&lt;/P&gt;&lt;P&gt;ippiAdd_16u32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAdd_16u32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAdd_32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAdd_32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAdd_32f_C1R&lt;/P&gt;&lt;P&gt;ippiAdd_32f_C3IR&lt;/P&gt;&lt;P&gt;ippiAdd_32f_C3R&lt;/P&gt;&lt;P&gt;ippiAdd_32f_C4IR&lt;/P&gt;&lt;P&gt;ippiAdd_32f_C4R&lt;/P&gt;&lt;P&gt;ippiAdd_8s32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAdd_8s32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAdd_8u32f_C1IMR&lt;/P&gt;&lt;P&gt;ippiAdd_8u32f_C1IR&lt;/P&gt;&lt;P&gt;ippiAdd_8u_C1IRSfs&lt;/P&gt;&lt;P&gt;ippiAdd_8u_C1RSfs&lt;/P&gt;&lt;P&gt;ippiAdd_8u_C3IRSfs&lt;/P&gt;&lt;P&gt;ippiAdd_8u_C3RSfs&lt;/P&gt;&lt;P&gt;ippiAdd_8u_C4IRSfs&lt;/P&gt;&lt;P&gt;ippiAdd_8u_C4RSfs&lt;/P&gt;&lt;BR /&gt;.............&lt;BR /&gt;&lt;P&gt;ippiSqrt_16s_AC4IRSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_16s_AC4RSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_16s_C1IRSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_16s_C1RSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_16s_C3IRSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_16s_C3RSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_16u_AC4IRSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_16u_AC4RSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_16u_C1IRSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_16u_C1RSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_16u_C3IRSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_16u_C3RSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_32f_AC4IR&lt;/P&gt;&lt;P&gt;ippiSqrt_32f_AC4R&lt;/P&gt;&lt;P&gt;ippiSqrt_32f_C1IR&lt;/P&gt;&lt;P&gt;ippiSqrt_32f_C1R&lt;/P&gt;&lt;P&gt;ippiSqrt_32f_C3IR&lt;/P&gt;&lt;P&gt;ippiSqrt_32f_C3R&lt;/P&gt;&lt;P&gt;ippiSqrt_32f_C4IR&lt;/P&gt;&lt;P&gt;ippiSqrt_8u_AC4IRSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_8u_AC4RSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_8u_C1IRSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_8u_C1RSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_8u_C3IRSfs&lt;/P&gt;&lt;P&gt;ippiSqrt_8u_C3RSfs&lt;/P&gt;&lt;P&gt;ippiSub128_JPEG_8u16s_C1R&lt;/P&gt;&lt;P&gt;ippiSubC_16s_C1IRSfs&lt;/P&gt;&lt;P&gt;ippiSubC_16s_C1RSfs&lt;/P&gt;&lt;P&gt;ippiSubC_16s_C3IRSfs&lt;/P&gt;&lt;P&gt;ippiSubC_16s_C3RSfs&lt;/P&gt;&lt;P&gt;ippiSubC_32f_AC4IR&lt;/P&gt;&lt;P&gt;ippiSubC_32f_C1R&lt;/P&gt;&lt;P&gt;ippiSubC_32f_C3IR&lt;/P&gt;&lt;P&gt;ippiSubC_32f_C3R&lt;/P&gt;&lt;P&gt;ippiSubC_8u_C1IRSfs&lt;/P&gt;&lt;P&gt;ippiSubC_8u_C1RSfs&lt;/P&gt;&lt;P&gt;ippiSubC_8u_C3IRSfs&lt;/P&gt;&lt;P&gt;ippiSubC_8u_C3RSfs&lt;/P&gt;&lt;P&gt;ippiSub_16s_C1IRSfs&lt;/P&gt;&lt;P&gt;ippiSub_16s_C1RSfs&lt;/P&gt;&lt;P&gt;ippiSub_16s_C3IRSfs&lt;/P&gt;&lt;P&gt;ippiSub_16s_C3RSfs&lt;/P&gt;&lt;P&gt;ippiSub_16s_C4IRSfs&lt;/P&gt;&lt;P&gt;ippiSub_16s_C4RSfs&lt;/P&gt;&lt;P&gt;ippiSub_32f_C1IR&lt;/P&gt;&lt;P&gt;ippiSub_32f_C1R&lt;/P&gt;&lt;P&gt;ippiSub_32f_C3IR&lt;/P&gt;&lt;P&gt;ippiSub_32f_C3R&lt;/P&gt;&lt;P&gt;ippiSub_32f_C4IR&lt;/P&gt;&lt;P&gt;ippiSub_32f_C4R&lt;/P&gt;&lt;P&gt;ippiSub_8u_C1IRSfs&lt;/P&gt;&lt;P&gt;ippiSub_8u_C1RSfs&lt;/P&gt;&lt;P&gt;ippiSub_8u_C3IRSfs&lt;/P&gt;&lt;P&gt;ippiSub_8u_C3RSfs&lt;/P&gt;&lt;P&gt;ippiSub_8u_C4IRSfs&lt;/P&gt;&lt;P&gt;ippiSub_8u_C4RSfs&lt;/P&gt;..............................&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt;Igor</description>
      <pubDate>Mon, 09 Jan 2012 18:46:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765923#M206</guid>
      <dc:creator>igorastakhov</dc:creator>
      <dc:date>2012-01-09T18:46:26Z</dc:date>
    </item>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765924#M207</link>
      <description>Hi!&lt;BR /&gt;&lt;P&gt;In example 9 float type images is processed. They have sizes 1280 * 960. This is too much to all these images accommodate into cache at the same time. On the other part arithmetic functions dont require great calculating resources. So the bottleneck for this example is memory access and the paralleling doesnt lead to performance improvement.&lt;/P&gt;&lt;P&gt;However if we begin to process images sequentially by small pieces block-by-block we can score a success.&lt;/P&gt;&lt;P&gt;I decreased image sizes and number of images. Also I changed the code slightly.&lt;/P&gt;&lt;P&gt;After that we can see performance improvement for increment of number of threads.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Ipp init: ippStsNoErr: No error, it's OK&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Number of cores = 2&lt;/P&gt;&lt;P&gt;Testing with 1 processors&lt;/P&gt;&lt;P&gt;2000 Iterations in 0.38 seconds (0.19 msec per iteration)&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Ipp init: ippStsNoErr: No error, it's OK&lt;/P&gt;&lt;P&gt;Number of cores = 2&lt;/P&gt;&lt;P&gt;Testing with 2 processors&lt;/P&gt;&lt;P&gt;2000 Iterations in 0.22 seconds (0.11 msec per iteration)&lt;BR /&gt;&lt;BR /&gt;The code with changes:&lt;/P&gt;&lt;P&gt;int _tmain(int argc, _TCHAR* argv[])&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;// int Width = 1280, Height = 960;&lt;/P&gt;&lt;P&gt;int Width = 4096, Height = 16;&lt;/P&gt;&lt;P&gt;IppiSize iSize; iSize.width = Width; iSize.height = Height;&lt;/P&gt;&lt;P&gt;float* pIm[8];&lt;/P&gt;&lt;P&gt;for (int i = 0 ; i &amp;lt; 9 ; i++)&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;pIm&lt;I&gt; = (float*)new float[Width * Height];&lt;/I&gt;&lt;/P&gt;&lt;P&gt;ippiSet_32f_C1R((float)(100-i), pIm&lt;I&gt;, Width*sizeof(float), iSize);&lt;/I&gt;&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;// libInfo();&lt;/P&gt;&lt;P&gt;ippInit();&lt;/P&gt;&lt;P&gt;IppStatus sts;&lt;/P&gt;&lt;P&gt;sts = ippInitCpu(ippCpuC2D);&lt;/P&gt;&lt;P&gt;cout &amp;lt;&amp;lt; "Ipp init: " &amp;lt;&amp;lt; ippGetStatusString( sts ) &amp;lt;&amp;lt; endl;&lt;/P&gt;&lt;P&gt;cout &amp;lt;&amp;lt; "Number of cores = " &amp;lt;&amp;lt; ippGetNumCoresOnDie() &amp;lt;&amp;lt; endl;&lt;/P&gt;&lt;P&gt;int NumProcessors = 1; /*=2;*/&lt;/P&gt;&lt;P&gt;ippSetNumThreads(NumProcessors);&lt;/P&gt;&lt;P&gt;ippGetNumThreads(&amp;amp;NumProcessors);&lt;/P&gt;&lt;P&gt;cout &amp;lt;&amp;lt; "Testing with " &amp;lt;&amp;lt; NumProcessors &amp;lt;&amp;lt; " processors" &amp;lt;&amp;lt; endl;&lt;/P&gt;&lt;P&gt;clock_t tStart = clock();&lt;/P&gt;&lt;P&gt;int NumRepeats = 2000;&lt;/P&gt;&lt;P&gt;/*&lt;/P&gt;&lt;P&gt;for (int i = 0 ; i &amp;lt; NumRepeats ; i++)&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;// Ref1 = Ref1 = A - B&lt;/P&gt;&lt;P&gt;ippiSub_32f_C1R(pIm[0], Width*sizeof(float),pIm[1], Width*sizeof(float),pIm[2], Width*sizeof(float), iSize);&lt;/P&gt;&lt;P&gt;// Ref1 = Ref2 = C - D&lt;/P&gt;&lt;P&gt;ippiSub_32f_C1R(pIm[3], Width*sizeof(float),pIm[4], Width*sizeof(float),pIm[5], Width*sizeof(float), iSize);&lt;/P&gt;&lt;P&gt;// Temp3 = Sqr(Ref1), Temp4 = Sqr(Ref2)&lt;/P&gt;&lt;P&gt;ippiSqr_32f_C1R(pIm[2], Width*sizeof(float), pIm[6], Width*sizeof(float), iSize);&lt;/P&gt;&lt;P&gt;ippiSqr_32f_C1R(pIm[5], Width*sizeof(float), pIm[7], Width*sizeof(float), iSize);&lt;/P&gt;&lt;P&gt;// Result = Sqrt(Temp1 + Temp2)&lt;/P&gt;&lt;P&gt;ippiAdd_32f_C1IR(pIm[6], Width*sizeof(float), pIm[7], Width*sizeof(float), iSize);&lt;/P&gt;&lt;P&gt;ippiSqrt_32f_C1R(pIm[7], Width*sizeof(float), pIm[8], Width*sizeof(float), iSize);&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;*/&lt;/P&gt;&lt;P&gt;for (int i = 0 ; i &amp;lt; NumRepeats ; i++)&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;// Ref1 = Ref1 = A - B&lt;/P&gt;&lt;P&gt;ippiSub_32f_C1R(pIm[0], Width*sizeof(float),pIm[1], Width*sizeof(float),pIm[2], Width*sizeof(float), iSize);&lt;/P&gt;&lt;P&gt;// Ref1 = Ref2 = C - D&lt;/P&gt;&lt;P&gt;ippiSub_32f_C1R(pIm[3], Width*sizeof(float),pIm[4], Width*sizeof(float),pIm[5], Width*sizeof(float), iSize);&lt;/P&gt;&lt;P&gt;// Temp3 = Sqr(Ref1), Temp4 = Sqr(Ref2)&lt;/P&gt;&lt;P&gt;/*&lt;/P&gt;&lt;P&gt;ippiSqr_32f_C1IR(pIm[2], Width*sizeof(float), iSize);&lt;/P&gt;&lt;P&gt;ippiSqr_32f_C1IR(pIm[5], Width*sizeof(float), iSize);&lt;/P&gt;&lt;P&gt;*/&lt;/P&gt;&lt;P&gt;ippiMul_32f_C1IR(pIm[2], Width*sizeof(float), pIm[2], Width*sizeof(float), iSize);&lt;/P&gt;&lt;P&gt;ippiMul_32f_C1IR(pIm[5], Width*sizeof(float), pIm[2], Width*sizeof(float), iSize);&lt;/P&gt;&lt;P&gt;// Result = Sqrt(Temp1 + Temp2)&lt;/P&gt;&lt;P&gt;ippiAdd_32f_C1IR(pIm[2], Width*sizeof(float), pIm[5], Width*sizeof(float), iSize);&lt;/P&gt;&lt;P&gt;ippsSqrt_32f_I(pIm[5],Width*Height);&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;clock_t tEnd = clock();&lt;/P&gt;&lt;P&gt;double tSec = double(tEnd-tStart)/CLOCKS_PER_SEC;&lt;/P&gt;&lt;P&gt;double tMsec = 1000.0*tSec / NumRepeats;&lt;/P&gt;&lt;P&gt;cout &amp;lt;&amp;lt; NumRepeats &amp;lt;&amp;lt;" Iterations in " &amp;lt;&amp;lt; tSec &amp;lt;&amp;lt; " seconds (" &amp;lt;&amp;lt; tMsec &amp;lt;&amp;lt; " msec per iteration)" &amp;lt;&amp;lt; endl; &lt;/P&gt;&lt;P&gt;getchar();&lt;/P&gt;&lt;P&gt;for (int i = 0 ; i &amp;lt; 9 ; i++)&lt;/P&gt;&lt;P&gt;delete[] pIm&lt;I&gt;;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;return 0;&lt;/P&gt;&lt;P&gt;}&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;Ivan&lt;/P&gt;</description>
      <pubDate>Wed, 11 Jan 2012 09:21:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765924#M207</guid>
      <dc:creator>Ivan_Z_Intel</dc:creator>
      <dc:date>2012-01-11T09:21:30Z</dc:date>
    </item>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765925#M208</link>
      <description>Thanks for your effort Ivan, this is very interesting. It shows that the speedup from multiple cores is highly dependant on the use of the processor's memory cache.&lt;BR /&gt;&lt;BR /&gt;It also showed me that different IPP functions that do basically the same thing can take significantly shorter time to execute (ippsSqrt_32f_I() compared to ippiSqrt_32f_C1R(), for example).&lt;BR /&gt;&lt;BR /&gt;I'll experiment with processing our images in small blocks that can fit in the cache.&lt;BR /&gt;&lt;BR /&gt;Paul G.&lt;BR /&gt;</description>
      <pubDate>Wed, 11 Jan 2012 18:56:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765925#M208</guid>
      <dc:creator>paulsgauthier</dc:creator>
      <dc:date>2012-01-11T18:56:39Z</dc:date>
    </item>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765926#M209</link>
      <description>Paul,&lt;BR /&gt;&lt;BR /&gt;there are several problems: 1) your code uses 9 images/memory buffers - 5 Mbyte each - so as operations are too simple (add, sub, sqr, sqrt) but for each you need to perform 1 load and 1 store - all load in concentrated around memory bus - and you know - you can't speedup "copy" operation with multiple threads - you have only one memory bus; this means that you should optimize your code for cache size and data reuse - therefore you should perform processing by rather small slices. 2) ippiSqr is not threaded - this is why Ivan used Mul instead of Sqr - imagine that Sub is threaded - so work is divided between 2 CPUs and data - between their caches; then you call Sqr - it is not threaded and therefore all data is processed by 1 CPU - so all data from cache of 2nd CPU must be transfered to c=the cache of the 1st CPU; then you perform Add operation - it is threaded - that means that all data again must be spreaded between 2 caches... Mul is threaded; 3) ippiSqrt is marked as threaded - but it is not fully so - 2D Sqrt is based on 1D Sqrt (row by row) - and 2D Sqrt doesn't have special 2D threading - threaded is only 1D Sqrt - and it (1D Sqrt) has internal criterion ==4K - so it is threaded for vectors &amp;gt;= 4K - this is why Ivan used directly 1D Sqrt - to guarantee that threaded code works; 4) for your case the best approach (from the performance point of view) is to redevelop your code as a loop row by row, link with non threaded static IPP lib and to use #pragma parallel for before the loop - threading at the primitive level is not so efficient as at the application level - this is why we are promoting DMIP and are going to remove (deprecate) threading at the primitive level in IPP 8.0&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt;Igor</description>
      <pubDate>Thu, 12 Jan 2012 08:42:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765926#M209</guid>
      <dc:creator>igorastakhov</dc:creator>
      <dc:date>2012-01-12T08:42:16Z</dc:date>
    </item>
    <item>
      <title>External Multi-threading not working for IPPI functions</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765927#M210</link>
      <description>Igor,&lt;BR /&gt;&lt;BR /&gt;Yes, I agree that disabling the IPP threading and doing the threading in our own application is the better approach for us with our big images. When I do that for one of our functions I get a minor speedup (5%) with the two processors in my test system. Since both external threads are calling the same IPPI functions to process different slices of the same image, I think the cache limitations come in to play.&lt;BR /&gt;&lt;BR /&gt;Thanks for your help.&lt;BR /&gt;&lt;BR /&gt;Paul</description>
      <pubDate>Thu, 12 Jan 2012 18:02:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/External-Multi-threading-not-working-for-IPPI-functions/m-p/765927#M210</guid>
      <dc:creator>paulsgauthier</dc:creator>
      <dc:date>2012-01-12T18:02:03Z</dc:date>
    </item>
  </channel>
</rss>

