<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic On most Intel processors in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110835#M72138</link>
    <description>&lt;P&gt;On most Intel processors DGEMM only uses one thread per core.&amp;nbsp;&amp;nbsp; The code is very tightly constructed (with very careful cache blocking) to give the best performance in this configuration.&amp;nbsp;&amp;nbsp; There are very few avoidable stalls that could be overlapped with work in the other logical processor. Using the other logical processor would cut the available cache in half, which would reduce the block sizes, increase the cache miss rates, and decrease the overall performance.&lt;/P&gt;

&lt;P&gt;I have not looked at Intel's DGEMM implementation for Xeon Phi x200, but it is easy to believe that it has the same properties.&amp;nbsp; (The first generation Xeon Phi (Knights Corner) was an exception because a single thread could only issue instructions every other cycle, so two threads were required to reach maximum speed on compute-bound codes.&amp;nbsp; This limitation is not present in the second generation Xeon Phi (Knights Landing) -- one thread of execution can issue two instructions every cycle, getting reasonably close to peak performance.&lt;/P&gt;</description>
    <pubDate>Tue, 06 Sep 2016 20:37:52 GMT</pubDate>
    <dc:creator>McCalpinJohn</dc:creator>
    <dc:date>2016-09-06T20:37:52Z</dc:date>
    <item>
      <title>MKL DGEMM Hyperthreading.</title>
      <link>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110832#M72135</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I'm trying to call DGEMM on relatiely big matrices (m=10000 , n=100000, k=10000) on knights landing.&lt;/P&gt;

&lt;P&gt;When I’m profiling using vtune, I can see that a call to MKL DGEMM,&amp;nbsp; is having 68 threads working (which is the number of physical cores) but the expectation is that it uses 272 threads (logical cores) because of hyper-threading. Other parts of my code where i use (openmp simd) directives, are using up to 272 threads. I'm wondering if there is any settings i need to setup in order to get hyper-threading working for my case.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thanks,&lt;/P&gt;

&lt;P&gt;Ali&lt;/P&gt;</description>
      <pubDate>Wed, 24 Aug 2016 16:57:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110832#M72135</guid>
      <dc:creator>seyedalireza_y_</dc:creator>
      <dc:date>2016-08-24T16:57:53Z</dc:date>
    </item>
    <item>
      <title>Did you look into MKL_DYNAMIC</title>
      <link>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110833#M72136</link>
      <description>&lt;P&gt;Did you look into MKL_DYNAMIC setting?&amp;nbsp; Did you find an advantage in using all the logical threads (vs. spreading a smaller number evenly across cores) in your omp parallel (not simd alone) regions?&lt;/P&gt;</description>
      <pubDate>Thu, 25 Aug 2016 11:52:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110833#M72136</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-08-25T11:52:58Z</dc:date>
    </item>
    <item>
      <title>MKL_NUM_THREADS</title>
      <link>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110834#M72137</link>
      <description>&lt;P&gt;MKL_NUM_THREADS&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Note that many applications perform best using fewer than 4 threads per core.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;See this article for information about setting the number of threads per core,&amp;nbsp;&lt;A href="https://software.intel.com/en-us/articles/process-and-thread-affinity-for-intel-xeon-phi-processors-x200"&gt;https://software.intel.com/en-us/articles/process-and-thread-affinity-for-intel-xeon-phi-processors-x200&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 06 Sep 2016 14:56:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110834#M72137</guid>
      <dc:creator>Gregg_S_Intel</dc:creator>
      <dc:date>2016-09-06T14:56:55Z</dc:date>
    </item>
    <item>
      <title>On most Intel processors</title>
      <link>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110835#M72138</link>
      <description>&lt;P&gt;On most Intel processors DGEMM only uses one thread per core.&amp;nbsp;&amp;nbsp; The code is very tightly constructed (with very careful cache blocking) to give the best performance in this configuration.&amp;nbsp;&amp;nbsp; There are very few avoidable stalls that could be overlapped with work in the other logical processor. Using the other logical processor would cut the available cache in half, which would reduce the block sizes, increase the cache miss rates, and decrease the overall performance.&lt;/P&gt;

&lt;P&gt;I have not looked at Intel's DGEMM implementation for Xeon Phi x200, but it is easy to believe that it has the same properties.&amp;nbsp; (The first generation Xeon Phi (Knights Corner) was an exception because a single thread could only issue instructions every other cycle, so two threads were required to reach maximum speed on compute-bound codes.&amp;nbsp; This limitation is not present in the second generation Xeon Phi (Knights Landing) -- one thread of execution can issue two instructions every cycle, getting reasonably close to peak performance.&lt;/P&gt;</description>
      <pubDate>Tue, 06 Sep 2016 20:37:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110835#M72138</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-09-06T20:37:52Z</dc:date>
    </item>
    <item>
      <title>Thank you everyone for your</title>
      <link>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110836#M72139</link>
      <description>&lt;P&gt;Thank you everyone for your thorough help and explanations, I tried and saw that actually as pointed by Dr. McCalpin, it's better not to override the setting, as the performance is declined when using more threads per core and the code already seems to reach peak performance.&lt;/P&gt;</description>
      <pubDate>Tue, 25 Oct 2016 14:57:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110836#M72139</guid>
      <dc:creator>seyedalireza_y_</dc:creator>
      <dc:date>2016-10-25T14:57:00Z</dc:date>
    </item>
    <item>
      <title>1. I would also recommend to</title>
      <link>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110837#M72140</link>
      <description>1. I would also recommend to look at &lt;STRONG&gt;compact&lt;/STRONG&gt; and &lt;STRONG&gt;scatter&lt;/STRONG&gt; settings for &lt;STRONG&gt;KMP_AFFINITY&lt;/STRONG&gt; environment variable.

2. My extensive experience with xGEMM MKL functions shows that all these functions are very optimized when it comes to threading and it could be also controlled with &lt;STRONG&gt;OMP_NUM_THREADS&lt;/STRONG&gt; and &lt;STRONG&gt;KMP_AFFINITY&lt;/STRONG&gt; environment variables.

3. If, for example, a CPU with 4 hardware threads is used ( 4 cores, 8 logical CPUs ) then &lt;STRONG&gt;OMP_NUM_THREADS&lt;/STRONG&gt; needs to be set to &lt;STRONG&gt;4&lt;/STRONG&gt;, and it doesn't improve performance if it is set to &lt;STRONG&gt;8&lt;/STRONG&gt;.

4. Take into account that a programmer's control is very simple and this is how it could look like:
...
	#ifdef _RTTHREADTOPU_BINDING_SHOWINFO
	_RTLIBAPI RTtchar g_szThreadToPU[] = RTU("&lt;STRONG&gt;KMP_AFFINITY=granularity=fine,proclist=[0,2,4,6],explicit,verbose&lt;/STRONG&gt;");
	#else
	_RTLIBAPI RTtchar g_szThreadToPU[] = RTU("&lt;STRONG&gt;KMP_AFFINITY=granularity=fine,proclist=[0,2,4,6],explicit&lt;/STRONG&gt;");
	#endif
...</description>
      <pubDate>Fri, 09 Dec 2016 20:35:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110837#M72140</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2016-12-09T20:35:02Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...On most Intel processors</title>
      <link>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110838#M72141</link>
      <description>&amp;gt;&amp;gt;...On most Intel processors DGEMM only uses one thread per core...

Absolutely correct and I confirm that.</description>
      <pubDate>Fri, 09 Dec 2016 20:36:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110838#M72141</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2016-12-09T20:36:58Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...I'm trying to call DGEMM</title>
      <link>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110839#M72142</link>
      <description>&amp;gt;&amp;gt;...I'm trying to call DGEMM on relatiely big matrices (m=10000 , n=100000, k=10000) on knights landing...

I finally completed a set of tests for &lt;STRONG&gt;100Kx100K&lt;/STRONG&gt; square dense matrices on a &lt;STRONG&gt;KNL&lt;/STRONG&gt; server with &lt;STRONG&gt;64&lt;/STRONG&gt; cores. Three are no any performance improvements if more than 64 threads are used. Here are tests results:
...
Matrix multiplication C=A*B where matrix A( &lt;STRONG&gt;114688x114688&lt;/STRONG&gt; ) and matrix B( &lt;STRONG&gt;114688x114688&lt;/STRONG&gt; )
Allocating memory for matrices
Intializing matrix data
Matrix multiplication started
Matrix multiplication completed at 1941.544 seconds
Deallocating memory
Processing Completed
...</description>
      <pubDate>Tue, 07 Feb 2017 00:19:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/MKL-DGEMM-Hyperthreading/m-p/1110839#M72142</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-02-07T00:19:14Z</dc:date>
    </item>
  </channel>
</rss>

