<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic  Jim, in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-openmp-performance/m-p/977196#M5635</link>
    <description>&lt;P&gt;&amp;nbsp;Jim,&lt;/P&gt;
&lt;P&gt;I am also interested in understanding the difference in performance. But I am doubtful about the local cache/local NUMA pages explanation because&lt;/P&gt;
&lt;P&gt;1) The amount of data is 512*512*1024*4*4=1 GB, which is much greater than the L3 cache of two 8-core Xeons (~60 MB)&lt;/P&gt;
&lt;P&gt;2) When I modified the code to run the processing loop (line 35) twice, the run time was identical for both runs. That holds with parallel or serial initialization. If the cache hit ratio was an issue, then the second run must have been faster than the first.&lt;/P&gt;
&lt;P&gt;3) Also, I eliminated the NUMA hypothesis by using 16 threads and KMP_AFFINITY=compact (my system is 2-socket and has 32 logical cores). With OMP_NUM_THREADS=16 and KMP_AFFINITY=compact, all threads are placed on one CPU socket. However, when I run this code with multithreaded initialization, I get faster processing than with serial initialization.&lt;/P&gt;
&lt;P&gt;Andrey&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;P.S.: Ronglin, if you do not declare the loop index "idx" as PRIVATE, the overall performance increases&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 16 Aug 2013 19:21:47 GMT</pubDate>
    <dc:creator>Andrey_Vladimirov</dc:creator>
    <dc:date>2013-08-16T19:21:47Z</dc:date>
    <item>
      <title>Poor openmp performance</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-openmp-performance/m-p/977194#M5633</link>
      <description>&lt;P&gt;We have E5-2670 * 2, 16 cores in total.&lt;BR /&gt;We get the openmp performance as follows (the code is also attached below):&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;NUM THREADS:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&lt;BR /&gt;&amp;nbsp;Time:&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.53331303596497 &amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;NUM THREADS:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&lt;BR /&gt;&amp;nbsp;Time:&amp;nbsp;&amp;nbsp; 0.793078899383545 &lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;NUM THREADS:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4&lt;BR /&gt;&amp;nbsp;Time:&amp;nbsp;&amp;nbsp; 0.475617885589600 &lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;NUM THREADS:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 8&lt;BR /&gt;&amp;nbsp;Time:&amp;nbsp;&amp;nbsp; 0.478277921676636 &lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;NUM THREADS:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 14&lt;BR /&gt;&amp;nbsp;Time:&amp;nbsp;&amp;nbsp; 0.479882955551147&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;NUM THREADS:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 16&lt;BR /&gt;&amp;nbsp;Time:&amp;nbsp;&amp;nbsp; 0.499575138092041&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;OK, this scaling is very poor when the thread number larger than 4. &lt;BR /&gt;But if I uncomment the lines 17 and 24, let the initialization is &lt;BR /&gt;also done by openmp. The different results are:&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;NUM THREADS:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&lt;BR /&gt;&amp;nbsp;Time:&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.41038393974304 &lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;NUM THREADS:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&lt;BR /&gt;&amp;nbsp;Time:&amp;nbsp;&amp;nbsp; 0.723496913909912&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;NUM THREADS:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4&lt;BR /&gt;&amp;nbsp;Time:&amp;nbsp;&amp;nbsp; 0.386450052261353&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;NUM THREADS:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 8&lt;BR /&gt;&amp;nbsp;Time:&amp;nbsp;&amp;nbsp; 0.211269855499268 &lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;NUM THREADS:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 14&lt;BR /&gt;&amp;nbsp;Time:&amp;nbsp;&amp;nbsp; 0.185739994049072&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;NUM THREADS:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 16&lt;BR /&gt;&amp;nbsp;Time:&amp;nbsp;&amp;nbsp; 0.214301824569702&lt;BR /&gt;&lt;BR /&gt;Why the performances are so different? &lt;BR /&gt;&lt;BR /&gt;Some information:&lt;BR /&gt;ifort version 13.1.0&lt;BR /&gt;ifort -warn -openmp -vec-report=4 openmp.f90&lt;/P&gt;
&lt;P&gt;[fortran]&lt;/P&gt;
&lt;P&gt;PROGRAM OMPTEST&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; use omp_lib&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; !use mpi&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; implicit none&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; integer(4), parameter :: nx = 512, ny = 512, nz = 1024&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; integer(4) :: ip, np, idx, nTotal = nx * ny * nz&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; real(8) :: time, dx, dy, dz, bstore&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; real(8), dimension(:), allocatable :: bx, ey, ez, hx&lt;BR /&gt;!------------------------------------------------------------------------------|&lt;BR /&gt;!&amp;nbsp;&amp;nbsp; initial&lt;BR /&gt;!------------------------------------------------------------------------------|&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; dx = 0.3; dy = 0.4; dz = 0.5&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; allocate(bx(nTotal))&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; allocate(ey(nTotal))&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; allocate(ez(nTotal))&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; allocate(hx(nTotal))&lt;BR /&gt;!&amp;nbsp;&amp;nbsp;&amp;nbsp; !$OMP PARALLEL DO PRIVATE(idx)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; do idx = 1, nTotal&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; bx(idx) = idx&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ey(idx) = idx * 2&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ez(idx) = idx / 2&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; hx(idx) = idx + 1&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; enddo&lt;BR /&gt;!&amp;nbsp;&amp;nbsp;&amp;nbsp; !$OMP END PARALLEL DO&lt;BR /&gt;!------------------------------------------------------------------------------|&lt;BR /&gt;!&amp;nbsp;&amp;nbsp; start&lt;BR /&gt;!------------------------------------------------------------------------------|&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; time = omp_get_wtime()&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; !$OMP PARALLEL PRIVATE(ip, bstore, idx)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; !$OMP MASTER &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; np = omp_get_num_threads()&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; !$OMP END MASTER &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ip = omp_get_thread_num()&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; !$OMP DO &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; do idx = 1, nTotal - 1&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; bstore = bx(idx)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; bx(idx) = 2.0 * ((ey(idx + 1) - ey(idx)) / dz -&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;amp;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; (ez(idx + 1) - ez(idx)) / dy)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; bx(idx) = 1.0 * bx(idx)&amp;nbsp; + 2.0 * ((ey(idx + 1) - ey(idx)) / dz&amp;nbsp; -&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;amp;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; (ez(idx + 1) - ez(idx)) / dy)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; hx(idx)= 3.0 * hx(idx) + 4.0 * (5.0 * bx(idx) - 6.0 * bstore)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; end do&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; !$OMP END DO&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; !$OMP END PARALLEL &lt;BR /&gt;!------------------------------------------------------------------------------|&lt;BR /&gt;!&amp;nbsp;&amp;nbsp; end&lt;BR /&gt;!------------------------------------------------------------------------------|&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; print*, "NUM THREADS:", np&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; print*, "Time: ", omp_get_wtime() - time&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; print*, "Result:", sum(hx)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; deallocate(bx, ey, ez, hx)&lt;BR /&gt;end&lt;/P&gt;
&lt;P&gt;[/fortran]&lt;/P&gt;</description>
      <pubDate>Fri, 16 Aug 2013 03:53:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-openmp-performance/m-p/977194#M5633</guid>
      <dc:creator>Ronglin__J_</dc:creator>
      <dc:date>2013-08-16T03:53:34Z</dc:date>
    </item>
    <item>
      <title>The reason for the difference</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-openmp-performance/m-p/977195#M5634</link>
      <description>&lt;P&gt;The reason for the difference is when the first loop is parallized, the iteration space 1:nTotal-1 is partitioned by the number of thread in the thred team. Same true for the second loop's iteration space 1:nTotal-1. In the first loop bx(idx), ey(idx), ez(idx), hx(idx), for the index sub-range for a specific thread, are written not only to the RAM locations of the arrays, but also into the cache system used by the corrisponding threads as read by the second loop (due to same partitioning). IOW the second loop has higher probability of cache hit. Also, if your system BIOS was configured as NUMA, and if the runtime system is setup as "first touch", then at page level granularity,&amp;nbsp;the pages of the corrisponding locations "touched" (written) by the first loop, will reside in the RAM attached (nearer) to the socket of the thread that first touches the RAM of a given page. Then for locations subsiquently referenced by the second loop that were not within a cache, then these would have faster RAM access (due to being located on the RAM directly attached to the CPU within which the thread resides).&lt;/P&gt;
&lt;P&gt;Your program is an execellent example of why one should&amp;nbsp;parallelize the initialization of data in the same manner as subsequent processing of the data.&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Fri, 16 Aug 2013 14:58:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-openmp-performance/m-p/977195#M5634</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2013-08-16T14:58:00Z</dc:date>
    </item>
    <item>
      <title> Jim,</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-openmp-performance/m-p/977196#M5635</link>
      <description>&lt;P&gt;&amp;nbsp;Jim,&lt;/P&gt;
&lt;P&gt;I am also interested in understanding the difference in performance. But I am doubtful about the local cache/local NUMA pages explanation because&lt;/P&gt;
&lt;P&gt;1) The amount of data is 512*512*1024*4*4=1 GB, which is much greater than the L3 cache of two 8-core Xeons (~60 MB)&lt;/P&gt;
&lt;P&gt;2) When I modified the code to run the processing loop (line 35) twice, the run time was identical for both runs. That holds with parallel or serial initialization. If the cache hit ratio was an issue, then the second run must have been faster than the first.&lt;/P&gt;
&lt;P&gt;3) Also, I eliminated the NUMA hypothesis by using 16 threads and KMP_AFFINITY=compact (my system is 2-socket and has 32 logical cores). With OMP_NUM_THREADS=16 and KMP_AFFINITY=compact, all threads are placed on one CPU socket. However, when I run this code with multithreaded initialization, I get faster processing than with serial initialization.&lt;/P&gt;
&lt;P&gt;Andrey&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;P.S.: Ronglin, if you do not declare the loop index "idx" as PRIVATE, the overall performance increases&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Aug 2013 19:21:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-openmp-performance/m-p/977196#M5635</guid>
      <dc:creator>Andrey_Vladimirov</dc:creator>
      <dc:date>2013-08-16T19:21:47Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;With OMP_NUM_THREADS=16 and</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-openmp-performance/m-p/977197#M5636</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;With OMP_NUM_THREADS=16 and KMP_AFFINITY=compact, all threads are placed on one CPU socket&lt;/P&gt;
&lt;P&gt;Have you verified this? The behavior seems to be contradictory. The "only" difference, assuming same socket for all threads, would be as to if the non-master threads had begun the timed region in an expired KMP_BLOCK_TIME state.&lt;/P&gt;
&lt;P&gt;Have you run the timed&amp;nbsp;loop several times under VTune to see what is going on? (set loop count to about 15-30 seconds to get a&amp;nbsp;meaningful statistical sample).&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Fri, 16 Aug 2013 21:09:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-openmp-performance/m-p/977197#M5636</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2013-08-16T21:09:53Z</dc:date>
    </item>
    <item>
      <title>openmp makes the parallel</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-openmp-performance/m-p/977198#M5637</link>
      <description>&lt;P&gt;openmp makes the parallel loop index private by default.&amp;nbsp; to take advantage of first touch locality you will need affiinity set.&amp;nbsp; for one thread per core with ht you might set kmp_affinity=compact,1,1&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 17 Aug 2013 12:34:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-openmp-performance/m-p/977198#M5637</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-08-17T12:34:15Z</dc:date>
    </item>
    <item>
      <title>To Jim: I realized that the</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-openmp-performance/m-p/977199#M5638</link>
      <description>&lt;P&gt;To Jim: I realized that the initialization is also important to improve the efficiency of OMP parallelization. Thank you.&lt;/P&gt;
&lt;P&gt;To Andrey and TimP&lt;A href="http://software.intel.com/en-us/user/336903"&gt;&lt;/A&gt;: I set KMP_AFFINITY=compact, but the results seem even worse.&lt;/P&gt;
&lt;P&gt;Thank all of you for reply.&lt;/P&gt;</description>
      <pubDate>Mon, 19 Aug 2013 03:17:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Poor-openmp-performance/m-p/977199#M5638</guid>
      <dc:creator>Ronglin__J_</dc:creator>
      <dc:date>2013-08-19T03:17:10Z</dc:date>
    </item>
  </channel>
</rss>

