<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Slow down when runnning multiple threads with exact same algori in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814771#M1148</link>
    <description>640x480(x2 for shorts)= 614,400 bytes for image plus your NxN and (N+M+1)x(N+M+1) w/ N=M=5&lt;BR /&gt;&lt;BR /&gt;~600KB&lt;BR /&gt;&lt;BR /&gt;L2 cache size is 256KB&lt;BR /&gt;&lt;BR /&gt;The 4-tile 640x480 spills out of L2&lt;BR /&gt;&lt;BR /&gt;Try 16-tile 320x240(x2 for shorts)= ~150KB&lt;BR /&gt;This leaves ~100KB for other data.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;</description>
    <pubDate>Thu, 09 Feb 2012 18:37:12 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2012-02-09T18:37:12Z</dc:date>
    <item>
      <title>Slow down when runnning multiple threads with exact same algorithm</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814764#M1141</link>
      <description>I am using xeon x5680 3.33GHZ dual cpu with 6 cores each, windows 7 64-bit 12GB ram.&lt;BR /&gt;&lt;BR /&gt;I am runnig a filter on a image of size 640X480.&lt;BR /&gt;Using single thread to apply the filter on the image results 6.5 ms run time. &lt;BR /&gt;Moving to 2 threads eachhas its own image (same size - 640X480) and data structrues I get 7.4ms.&lt;BR /&gt;The performance keeps getting worse as I increase the number of threads - 20ms for 6 threads.&lt;BR /&gt;The same goes to the performance when using 7-12 threads.&lt;BR /&gt;In case of 7 threads - the threads binded to the first cpu (each to its own core) I get 20ms and the single one binded to core 7 (2nd cpu) runs at 7ms.&lt;BR /&gt;&lt;BR /&gt;Overall - running 6/12 threads I get 3 times slowdown.&lt;BR /&gt;I know there should be certian slowdown - but 3 times is huge slowdown...</description>
      <pubDate>Tue, 07 Feb 2012 11:17:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814764#M1141</guid>
      <dc:creator>gilgil</dc:creator>
      <dc:date>2012-02-07T11:17:19Z</dc:date>
    </item>
    <item>
      <title>Slow down when runnning multiple threads with exact same algori</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814765#M1142</link>
      <description>Hello,&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;What kind of parallelization technique do you use? (openmp/tbb/...)&lt;/DIV&gt;&lt;DIV&gt;You say that single thread processing runs 6.5 ms - is not this already fast (joke)?&lt;/DIV&gt;&lt;DIV&gt;Seriously, parallelization usually has overhead introduced by its runtime and/or operating system and small tasks do not benefit from parallelization even they are perfect for parallelization. This probably you case.&lt;/DIV&gt;&lt;DIV&gt;Could make the data bigger to check this?&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Alex&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 07 Feb 2012 12:49:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814765#M1142</guid>
      <dc:creator>Alexander_C_Intel</dc:creator>
      <dc:date>2012-02-07T12:49:32Z</dc:date>
    </item>
    <item>
      <title>Slow down when runnning multiple threads with exact same algori</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814766#M1143</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1328623247593="55" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=328035" href="https://community.intel.com/en-us/profile/328035/" class="basic"&gt;gilgil&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;I&gt;I am using xeon x5680 3.33GHZ dual cpu with 6 cores each, windows 7 64-bit 12GB ram.&lt;BR /&gt;&lt;BR /&gt;I am runnig a &lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;filter on a image of size 640X480&lt;/SPAN&gt;&lt;/STRONG&gt;.&lt;BR /&gt;Using single thread to apply the filter on the image results 6.5 ms run time. &lt;BR /&gt;Moving to 2 threads eachhas its own image (same size - 640X480) and data structrues I get 7.4ms.&lt;BR /&gt;...&lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;What kind of filtering areyou doing? Are you using &lt;STRONG&gt;IPP&lt;/STRONG&gt;?&lt;BR /&gt;&lt;BR /&gt;An image of size&lt;STRONG&gt;640x480&lt;/STRONG&gt; is so small that it is hard tobelieve inany performance gains from switching to&lt;BR /&gt;more thanone thread to process it. Don't forget aboutcontext switches of threads becausethey don't&lt;BR /&gt;happen instantly.&lt;BR /&gt;&lt;BR /&gt;I woulduse a different technique to increase a performance of processing, that is,a &lt;STRONG&gt;priority boost&lt;/STRONG&gt;to &lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;high&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;BR /&gt;for a thread that does image processing.&lt;BR /&gt;&lt;BR /&gt;You need to use&lt;STRONG&gt;vTune&lt;/STRONG&gt; to analyze why the slowdown happens.&lt;/P&gt;</description>
      <pubDate>Tue, 07 Feb 2012 14:20:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814766#M1143</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-02-07T14:20:19Z</dc:date>
    </item>
    <item>
      <title>Slow down when runnning multiple threads with exact same algori</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814767#M1144</link>
      <description>&amp;gt;&amp;gt;Using single thread to apply the filter on the image results 6.5 ms run time&lt;BR /&gt;&lt;BR /&gt;6.5 ms/frame&lt;BR /&gt;&lt;BR /&gt;&amp;gt;&amp;gt;Moving to 2 threads eachhas its own image (same size - 640X480) and data structrues I get 7.4ms&lt;BR /&gt;&lt;BR /&gt;3.7ms/frame (7.4/2)&lt;BR /&gt;&lt;BR /&gt;&amp;gt;&amp;gt;20ms for 6 threads&lt;BR /&gt;&lt;BR /&gt;3.3ms/frame&lt;BR /&gt;&lt;BR /&gt;This shows you have a scaling problem when running with more than 2 threads.&lt;BR /&gt;&lt;BR /&gt;The scaling issue may be due to one or more of a few possibilities:&lt;BR /&gt;&lt;BR /&gt;1) Your frames are stored in a file and processing is: Read, filter, Write (total time == frame rate)&lt;BR /&gt; The correction for this is to pipeline the process:&lt;BR /&gt; Read, filter, Write&lt;BR /&gt; Read, filter, Write&lt;BR /&gt; ...&lt;BR /&gt; If possible, try to place the input and output files on different drives (to eliminate some seeks)&lt;BR /&gt;&lt;BR /&gt;3) Your algorithm is not L1/L2 cache friendly. The correction is to rework your code such that you use L1 and L2 cache more effectively. L1/L2 are private per core (or per die on older CPUs), the last level cache (L3) is shared. You can also rework your code such that it uses L3 cache size / number of hardware theads sharing the L3.&lt;BR /&gt;&lt;BR /&gt;4) Consider setting up the system to run as NUMA. This will reduce some of the memory access latencies when re-reading the filter data.&lt;BR /&gt;&lt;BR /&gt;5) Your filter code is not effectively using SSE.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Tue, 07 Feb 2012 16:33:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814767#M1144</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2012-02-07T16:33:25Z</dc:date>
    </item>
    <item>
      <title>Slow down when runnning multiple threads with exact same algori</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814768#M1145</link>
      <description>Thanks for all the replies.&lt;BR /&gt;I mistakenly wrote that I use 6/12 core and get 20ms per iteration.&lt;BR /&gt;Those numbers are true for using 4 cores out of 6 in each cpu.&lt;BR /&gt;For 6/12 cores the performance is even worse - 29ms..&lt;BR /&gt;&lt;BR /&gt;The filter is a variation on the non-local mean andbesides the input image, it uses7 additional buffers. All the buffers are of type short and are of the same size.&lt;BR /&gt;&lt;BR /&gt;Ido not use ipp but sse code (sse4.2) and it is highly optimized.&lt;BR /&gt;&lt;BR /&gt;I do not performany i/o operations besides the initial read. Since each iteration the imagechanges (blurred more and more) I use it as both theinput and the output of the algorithm.&lt;BR /&gt;The NLM algorithm requires many reads per pixel - NXN for the kernel and (M+N+1)X(M+N+1) for the search area.&lt;BR /&gt;In my case both N and M are 5. Reducing both of them to 3 gives a small improvement in single thread -5.9ms, yet running on 4 cores per cpu gives the same results - 20ms.&lt;BR /&gt;&lt;BR /&gt;I going to check it for smaller size images to verify it is a cache problem.&lt;BR /&gt;</description>
      <pubDate>Wed, 08 Feb 2012 08:12:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814768#M1145</guid>
      <dc:creator>gilgil</dc:creator>
      <dc:date>2012-02-08T08:12:39Z</dc:date>
    </item>
    <item>
      <title>Slow down when runnning multiple threads with exact same algori</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814769#M1146</link>
      <description>Array of shorts, N=5 or 3 M=5 or 3&lt;BR /&gt;&lt;BR /&gt;Consider using 8 or 4. &lt;BR /&gt;&lt;BR /&gt;&amp;gt;&amp;gt;single thread -5.9ms, yet running on 4 cores per cpu gives the same results - 20ms&lt;BR /&gt;&lt;BR /&gt;Is each thread all the work or 1/4 the work?&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Wed, 08 Feb 2012 14:48:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814769#M1146</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2012-02-08T14:48:52Z</dc:date>
    </item>
    <item>
      <title>Slow down when runnning multiple threads with exact same algori</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814770#M1147</link>
      <description>The whole work...&lt;BR /&gt;The image is orignally 4 times bigger (2 in width and2 in height). I divide it to 4 quarters of 640X480</description>
      <pubDate>Thu, 09 Feb 2012 08:26:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814770#M1147</guid>
      <dc:creator>gilgil</dc:creator>
      <dc:date>2012-02-09T08:26:47Z</dc:date>
    </item>
    <item>
      <title>Slow down when runnning multiple threads with exact same algori</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814771#M1148</link>
      <description>640x480(x2 for shorts)= 614,400 bytes for image plus your NxN and (N+M+1)x(N+M+1) w/ N=M=5&lt;BR /&gt;&lt;BR /&gt;~600KB&lt;BR /&gt;&lt;BR /&gt;L2 cache size is 256KB&lt;BR /&gt;&lt;BR /&gt;The 4-tile 640x480 spills out of L2&lt;BR /&gt;&lt;BR /&gt;Try 16-tile 320x240(x2 for shorts)= ~150KB&lt;BR /&gt;This leaves ~100KB for other data.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;</description>
      <pubDate>Thu, 09 Feb 2012 18:37:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814771#M1148</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2012-02-09T18:37:12Z</dc:date>
    </item>
    <item>
      <title>Slow down when runnning multiple threads with exact same algori</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814772#M1149</link>
      <description>&lt;DIV&gt;&lt;SPAN style="font-family: Verdana, Arial, Helvetica, sans-serif;"&gt;It looks like you write to the same memory within the same cache line or calculating chunks are too small and there is big overhead in synchronization.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN style="font-family: Verdana, Arial, Helvetica, sans-serif;"&gt;Which technique did you use to parallel?&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN style="font-family: Verdana, Arial, Helvetica, sans-serif;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN style="font-family: Verdana, Arial, Helvetica, sans-serif;"&gt;--Vladimir&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN style="font-family: Verdana, Arial, Helvetica, sans-serif;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;</description>
      <pubDate>Mon, 13 Feb 2012 07:21:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Slow-down-when-runnning-multiple-threads-with-exact-same/m-p/814772#M1149</guid>
      <dc:creator>Vladimir_P_1234567890</dc:creator>
      <dc:date>2012-02-13T07:21:35Z</dc:date>
    </item>
  </channel>
</rss>

