<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Can you attach your test in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961658#M21948</link>
    <description>&lt;P&gt;Can you attach your test program? (include any environment variables relating to KMP_... and OMP_.)&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
    <pubDate>Sat, 29 Mar 2014 13:13:07 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2014-03-29T13:13:07Z</dc:date>
    <item>
      <title>Achieving peak on Xeon Phi</title>
      <link>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961651#M21941</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;I am on a corei7 quad core machine with ASUS P9X79WS motherboard and Xeon Phi 3120A card installed.&lt;/P&gt;

&lt;P&gt;Operating system is RHEL 6.4 with mpss 3.1 for phi and parallel_sutdio_2013 SP1 installed.&lt;/P&gt;

&lt;P&gt;Just for detail, the phi card has 57 cores, with capability of about 1003 GFlops for double precision.&lt;/P&gt;

&lt;P&gt;I am seeing some performance issues that I don't understand.&lt;/P&gt;

&lt;P&gt;When I time MKL's parallel DGEMM on phi card, it is getting 300GFlops, which is about 30% of peak.&lt;/P&gt;

&lt;P&gt;Note that I am doing native execution.&lt;/P&gt;

&lt;P&gt;Now this performance is not matching what is posted here &lt;A href="http://software.intel.com/en-us/intel-mkl/" target="_blank"&gt;http://software.intel.com/en-us/intel-mkl/&lt;/A&gt; (achieving about 80% of peak).&lt;/P&gt;

&lt;P&gt;So, my first question is, is this difference solely because I am using low-end phi card so there are limitations?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;After seeing this, I wrote a test program that tries to achieve peak with assembly language.&lt;/P&gt;

&lt;P&gt;The function is simple. It runs a loop that iterates for 25,000,000 times and in each iteration, I am doing 30 independent FMA instructions and unrolled 8 times. So the total flop in each iteration is 30x8x2x8 = 3840 . Note that, this means I am doing 25000000x3840 floating point operations without accessing any memory.&lt;/P&gt;

&lt;P&gt;Now if I run this code serially, I get 8.74 Gflops which is basically the serial peak (8.8 Gflops).&lt;/P&gt;

&lt;P&gt;If I run this code in parallel with 2 threads on 1 core, I get 17.4 Gflops which is basically the peak for 1 core (17.6 Gflops).&lt;/P&gt;

&lt;P&gt;Now the problem is, if I run the same code in parallel with 2 threads per core and using 56 cores (112 threads), I only get 89% of peak.&lt;/P&gt;

&lt;P&gt;But if I run it with 4 threads per core i.e. a total of 224 threads, I get 99% of peak, which is what I expect.&lt;/P&gt;

&lt;P&gt;So, my second question is, even when I have no memory access at all, why did I need 4 threads to achieve peak?&lt;/P&gt;

&lt;P&gt;Is there any other latency that we don't know about that gets hidden by 4 threads per core?&lt;/P&gt;

&lt;P&gt;Can someone please clarify?&lt;/P&gt;

&lt;P&gt;Sorry for the long post and Thank you for reading.&lt;/P&gt;</description>
      <pubDate>Wed, 26 Mar 2014 22:15:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961651#M21941</guid>
      <dc:creator>Rakib</dc:creator>
      <dc:date>2014-03-26T22:15:55Z</dc:date>
    </item>
    <item>
      <title>What are the matrix sizes you</title>
      <link>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961652#M21942</link>
      <description>&lt;P&gt;What are the matrix sizes you used in&amp;nbsp;&lt;SPAN style="font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 18px;"&gt;DGEMM? MKL in PHi gets modest performance for small and skinny matrices.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 18px;"&gt;Regarding your second question, I think Phi needs to run 4 threads per core to compensate the in-order instruction execution fashion.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 27 Mar 2014 01:22:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961652#M21942</guid>
      <dc:creator>JS1</dc:creator>
      <dc:date>2014-03-27T01:22:00Z</dc:date>
    </item>
    <item>
      <title>Thank you for replying.</title>
      <link>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961653#M21943</link>
      <description>&lt;P&gt;Thank you for replying.&lt;/P&gt;

&lt;P&gt;Sorry, I forgot to mention that. I ran square DGEMM with N=10,000 .&lt;/P&gt;

&lt;P&gt;For the second question, I think they mention that you need at least 2 threads per core because 1 thread can issue an instruction every other cycle.&lt;/P&gt;

&lt;P&gt;Since I have 30 consecutive FMA instructions that are independent, shouldn't it be enough to saturate the FPU pipe?&lt;/P&gt;

&lt;P&gt;And since we have enough independent instructions, in-order execution shouldn't be an issue. Should it?&lt;/P&gt;</description>
      <pubDate>Thu, 27 Mar 2014 01:38:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961653#M21943</guid>
      <dc:creator>Rakib</dc:creator>
      <dc:date>2014-03-27T01:38:45Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;Since I have 30 consecutive</title>
      <link>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961654#M21944</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;Since I have 30 consecutive FMA instructions that are independent, shouldn't it be enough to saturate the FPU pipe?&lt;/P&gt;

&lt;P&gt;Yes, but your also have the store (to register) portion of time for the intrinsic function. With two threads the FPU might not be fully utilized. What does 3 threads per core yield?&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 27 Mar 2014 12:39:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961654#M21944</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-03-27T12:39:29Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961655#M21945</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;Please note that I am not using intrinsics. I wrote a function purely in assembly. I am calling this function from a C code where I am timing it.&lt;/P&gt;

&lt;P&gt;As for the pipeline, I verified that FMA instructions have a latency of 4.&lt;/P&gt;

&lt;P&gt;So you only need 2 independent FMA instructions, coming from 2 different threads making it 4 independent instructions to fully utilize the FPU.&lt;/P&gt;

&lt;P&gt;Also note that, if I weren't fully utilizing the FPU pipe, I wouldn't get the peak for 2 threads with just 1 core.&lt;/P&gt;

&lt;P&gt;My original question is, why we can get peak with 2 threads when using 1 core but not when using all 56 cores?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;As for your question, right now with the same code, I am getting 65% of peak with 2th/core, 84% with 3th/core and 99% with 4th/core .&lt;/P&gt;

&lt;P&gt;This is my other concern. Sometimes I am seeing inconsistencies of performance on phi.&lt;/P&gt;</description>
      <pubDate>Thu, 27 Mar 2014 21:35:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961655#M21945</guid>
      <dc:creator>Rakib</dc:creator>
      <dc:date>2014-03-27T21:35:00Z</dc:date>
    </item>
    <item>
      <title>Hi Rakib,</title>
      <link>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961656#M21946</link>
      <description>&lt;P&gt;Hi Rakib,&lt;/P&gt;

&lt;P&gt;You said that you were getting 100% of peak on 1 core using 2 threads, but that you drop to 89% when running 2 threads on each of 56 cores -- can you run a scaling test to see how many cores you have to use before you start seeing this problem?&lt;/P&gt;

&lt;P&gt;In the 112 thread case, you might be seeing some interference from the operating system.&amp;nbsp; I have seen the OS run stuff on cores where I am trying to do compute work, even when I leave other cores open.&amp;nbsp;&amp;nbsp; It is possible that using all four threads on a core makes it less likely that the OS will schedule on that core -- all the logical processors are busy.&amp;nbsp; But if you only run two threads on a core the OS might think "hey, there are two logical processors here that I can use to run this service process!".&lt;/P&gt;

&lt;P&gt;To track this down, I would time each thread independently and see if the slow-down comes from a small number of cores.&amp;nbsp; VTune might be able to identify instances when cores run something that is not your executable?&lt;/P&gt;</description>
      <pubDate>Thu, 27 Mar 2014 22:12:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961656#M21946</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2014-03-27T22:12:00Z</dc:date>
    </item>
    <item>
      <title>Hi John,</title>
      <link>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961657#M21947</link>
      <description>&lt;P&gt;Hi John,&lt;/P&gt;

&lt;P&gt;Thank you for your reply. That's exactly what I tried next: scaling test. I didn't put the result here because it's not consistent. But here it is:&lt;/P&gt;

&lt;P&gt;With 2 threads per core, following is the performance as I keep increasing the number of cores. Ellipsis are used where values are not changing much.&lt;/P&gt;

&lt;P&gt;Number of Cores&amp;nbsp;&amp;nbsp; &amp;nbsp;Peak Performance&lt;BR /&gt;
	1&amp;nbsp;&amp;nbsp; &amp;nbsp;99.38%&lt;BR /&gt;
	2&amp;nbsp;&amp;nbsp; &amp;nbsp;99.37%&lt;BR /&gt;
	3&amp;nbsp;&amp;nbsp; &amp;nbsp;99.35%&lt;BR /&gt;
	4&amp;nbsp;&amp;nbsp; &amp;nbsp;99.34%&lt;BR /&gt;
	5&amp;nbsp;&amp;nbsp; &amp;nbsp;99.31%&lt;BR /&gt;
	6&amp;nbsp;&amp;nbsp; &amp;nbsp;66.23%&lt;BR /&gt;
	...&lt;BR /&gt;
	14&amp;nbsp;&amp;nbsp; &amp;nbsp;66.17%&lt;BR /&gt;
	15&amp;nbsp;&amp;nbsp; &amp;nbsp;82.41%&lt;BR /&gt;
	16&amp;nbsp;&amp;nbsp; &amp;nbsp;81.92%&lt;BR /&gt;
	17&amp;nbsp;&amp;nbsp; &amp;nbsp;81.49%&lt;BR /&gt;
	18&amp;nbsp;&amp;nbsp; &amp;nbsp;99.13%&lt;BR /&gt;
	19&amp;nbsp;&amp;nbsp; &amp;nbsp;81.08%&lt;BR /&gt;
	20&amp;nbsp;&amp;nbsp; &amp;nbsp;80.67%&lt;BR /&gt;
	21&amp;nbsp;&amp;nbsp; &amp;nbsp;91.08%&lt;BR /&gt;
	22&amp;nbsp;&amp;nbsp; &amp;nbsp;80.51%&lt;BR /&gt;
	...&lt;BR /&gt;
	26&amp;nbsp;&amp;nbsp; &amp;nbsp;78.45%&lt;BR /&gt;
	27&amp;nbsp;&amp;nbsp; &amp;nbsp;98.98%&lt;BR /&gt;
	28&amp;nbsp;&amp;nbsp; &amp;nbsp;98.96%&lt;BR /&gt;
	29&amp;nbsp;&amp;nbsp; &amp;nbsp;98.92%&lt;BR /&gt;
	30&amp;nbsp;&amp;nbsp; &amp;nbsp;81.10%&lt;BR /&gt;
	...&lt;BR /&gt;
	55&amp;nbsp;&amp;nbsp; &amp;nbsp;82.54%&lt;BR /&gt;
	56&amp;nbsp;&amp;nbsp; &amp;nbsp;65.86%&lt;/P&gt;

&lt;P&gt;This is not repeatable. It varies randomly. But using 56 threads always gets poor performance.&lt;/P&gt;

&lt;P&gt;I will also put the per-thread timing in the next post.&lt;/P&gt;</description>
      <pubDate>Fri, 28 Mar 2014 21:41:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961657#M21947</guid>
      <dc:creator>Rakib</dc:creator>
      <dc:date>2014-03-28T21:41:15Z</dc:date>
    </item>
    <item>
      <title>Can you attach your test</title>
      <link>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961658#M21948</link>
      <description>&lt;P&gt;Can you attach your test program? (include any environment variables relating to KMP_... and OMP_.)&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Sat, 29 Mar 2014 13:13:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961658#M21948</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-03-29T13:13:07Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt; It is possible that using</title>
      <link>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961659#M21949</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;nbsp;It is possible that using all four threads on a core makes it less likely that the OS will schedule on that core -- all the logical processors are busy.&lt;/P&gt;

&lt;P&gt;A test of this hypothesis would be to schedule all four threads per core, but then four tests&lt;/P&gt;

&lt;P&gt;1 thread per core FMA, 3 threads per core loop of _mm_delay32(1000000)&lt;BR /&gt;
	2&amp;nbsp;threads per core FMA,&amp;nbsp;2 threads per core loop of _mm_delay32(1000000)&lt;BR /&gt;
	3&amp;nbsp;threads per core FMA,&amp;nbsp;1 thread per core loop of _mm_delay32(1000000)&lt;BR /&gt;
	4&amp;nbsp;threads per core FMA,&amp;nbsp;0 threads per core loop of _mm_delay32(1000000)&lt;/P&gt;

&lt;P&gt;The O/S will view the delaying threads as compute bound. The delay loop will terminate upon seeing any of the compute threads complete (e.g. setting volatile flag).&lt;/P&gt;

&lt;P&gt;The permutations could be made in one run the&amp;nbsp;application whereby an outer control loop permutes flags, one per logical processor and flag indicates if thread to run FMA or delay loop.&lt;/P&gt;

&lt;P&gt;Note, the _mm_delay32 will reduce power consumption for core, and more importantly nearly eliminate the thread scheduling of time slot within its core.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 29 Mar 2014 16:30:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961659#M21949</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-03-29T16:30:01Z</dc:date>
    </item>
    <item>
      <title>Hi Jim,</title>
      <link>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961660#M21950</link>
      <description>&lt;P&gt;Hi Jim,&lt;/P&gt;

&lt;P&gt;Please note that, I am not using OpenMP. I am using pthreads, setting the affinity myself.&lt;/P&gt;

&lt;P&gt;Nice suggestion for the tests. I was trying to spawn the dummy threads and then just wait on a barrier. It didn't help.&lt;/P&gt;

&lt;P&gt;But with your suggestion of busy-waiting, it actually helps. I was able to achieve 98% of peak with 2 active threads and 2 dummy threads per core.&lt;/P&gt;

&lt;P&gt;So, to summarize the answer for my second question, even if it is best, in terms of cache, for an application to use 2 threads/core, you still need to spawn 4 threads/core with 2 on each core busy-waiting.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;For the first question, are you seeing similar performance (30% of peak) from MKL DGEMM on phi?&lt;/P&gt;</description>
      <pubDate>Mon, 31 Mar 2014 19:07:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961660#M21950</guid>
      <dc:creator>Rakib</dc:creator>
      <dc:date>2014-03-31T19:07:44Z</dc:date>
    </item>
    <item>
      <title>I do not typically use MKL in</title>
      <link>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961661#M21951</link>
      <description>&lt;P&gt;I do not typically use MKL in my applications. What I do know is:&lt;/P&gt;

&lt;P&gt;When your program is single-threaded, use (link)&amp;nbsp;the multi-threaded MKL&lt;BR /&gt;
	When your program is multi-threaded, use (link) the single-threaded MKL&lt;/P&gt;

&lt;P&gt;Running multi-threaded program with multi-threaded MKL oversubscribes threads (each having thread pool).&lt;BR /&gt;
	If you do need this combination, e.g. MKL only called from outside parallel region of multi-threaded application, then you may find you need to set KMP_BLOCKTIME=0.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 01 Apr 2014 12:20:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961661#M21951</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-04-01T12:20:17Z</dc:date>
    </item>
    <item>
      <title>Hi Jim,</title>
      <link>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961662#M21952</link>
      <description>&lt;P&gt;Hi Jim,&lt;/P&gt;

&lt;P&gt;Yes. I am using single threaded program when I am trying to time MKL's parallel DGEMM performance.&lt;/P&gt;

&lt;P&gt;I am just curious about the mismatch in performance with Intel's reported performance.&lt;/P&gt;</description>
      <pubDate>Tue, 01 Apr 2014 14:08:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Achieving-peak-on-Xeon-Phi/m-p/961662#M21952</guid>
      <dc:creator>Rakib</dc:creator>
      <dc:date>2014-04-01T14:08:17Z</dc:date>
    </item>
  </channel>
</rss>

