<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: MKL Performance Improvement Suggestion in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1242721#M30601</link>
    <description>&lt;P&gt;Here is a chart of percentage improvement across cores 1, 2, 3, &amp;amp; 4 t/c&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="2021-01-02_14-29-18.jpg" style="width: 999px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/14512iBCAF0FBDCA0F7C6F/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="2021-01-02_14-29-18.jpg" alt="2021-01-02_14-29-18.jpg" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;In this test using 2HTs/core adds between 20% and 30% boost in performance. YMMV&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
    <pubDate>Sat, 02 Jan 2021 20:33:56 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2021-01-02T20:33:56Z</dc:date>
    <item>
      <title>MKL Performance Improvement Suggestion</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1242717#M30600</link>
      <description>&lt;P&gt;On a &lt;EM&gt;&lt;STRONG&gt;Windows&lt;/STRONG&gt; &lt;/EM&gt;system with multiple NUMA nodes and large number of cores, it is not unusual to have an MKL function call where MKL will examine the argument dimensions and then select a reduced set of logical processors for an (intended) optimal performance of the function.&lt;/P&gt;
&lt;P&gt;My observations seem to indicate that when MKL chooses a subset of the available (process/calling thread's constricted) affinities that the subset # threads are selected from the&lt;EM&gt; first # of threads&lt;/EM&gt; from the available affinities, as opposed to using the hardware topology of the available threads.&lt;/P&gt;
&lt;P&gt;For example, KNL 7210 configured with 4 NUMA nodes (each one processor group on Windows, HT enabled, 4t/c, each NUMA node has 64 HW threads, 16 cores, 8 L2's. The optimal pick order, within this node (assuming all HW threads are pins of the calling thread) would be:&lt;/P&gt;
&lt;P&gt;0,8,16,24,32,40,48,56 (1st thread of 1st core of each L2) then&lt;BR /&gt;4,12,20,28,36,44,52,60 (1st thread 2nd core of each L2) then&lt;BR /&gt;1,9,17,25,33,41,49,57 (2nd thread of 1st core of each L2) then&lt;BR /&gt;...&lt;/P&gt;
&lt;P&gt;This observation was made on system with KNL 7210&lt;/P&gt;
&lt;P&gt;OMP: Info #155: KMP_AFFINITY: Initial OS proc set not respected: 64-127&lt;BR /&gt;OMP: Info #214: KMP_AFFINITY: decoding x2APIC ids.&lt;BR /&gt;OMP: Info #156: KMP_AFFINITY: 256 available OS procs&lt;BR /&gt;OMP: Info #157: KMP_AFFINITY: Uniform topology&lt;BR /&gt;OMP: Info #285: KMP_AFFINITY: topology layer "LL cache" is equivalent to "core".&lt;BR /&gt;OMP: Info #285: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".&lt;BR /&gt;OMP: Info #191: KMP_AFFINITY: 1 socket x 64 cores/socket x 4 threads/core (64 total cores)&lt;/P&gt;
&lt;P&gt;The test was making repeated calls to the MKL function HEEVR using complex double array dimensioned (500,500). Where there is 1 calling thread pinned to the 64 HW threads of a node.&lt;/P&gt;
&lt;P&gt;MKL is likely choosing lesser than 64 threads to perform these computation.&lt;/P&gt;
&lt;P&gt;Note, the test program reads the OMP_PLACES and then affinities the respective OpenMP thread to its place (only 1 place in the data below). The environment vars are listed as well as the run time affinities in processor group are listed:&lt;/P&gt;
&lt;P&gt;My suspicion is that a subset of the 64 logical processors are taken from the first N available processors of the calling thread.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;SET OMP_NUM_THREADS=&lt;BR /&gt;SET KMP_HW_SUBSET=&lt;BR /&gt;SET OMP_PPROC_BIND=&lt;BR /&gt;SET KMP_AFFINITY=&lt;BR /&gt;SET OMP_PLACES={0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60}&lt;BR /&gt;OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60&lt;BR /&gt;1tPlaceNode(0), 8.3100&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;SET OMP_NUM_THREADS=&lt;BR /&gt;SET KMP_HW_SUBSET=&lt;BR /&gt;SET OMP_PPROC_BIND=&lt;BR /&gt;SET KMP_AFFINITY=&lt;BR /&gt;SET OMP_PLACES={0:2,4:2,8:2,12:2,16:2,20:2,24:2,28:2,32:2,36:2,40:2,44:2,48:2,52:2,56:2,60:2}&lt;BR /&gt;OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 1 4 5 8 9 12 13 16 17 20 21 24 25 28 29 32 33 36 37 40 41 44 45 48 49 52 53 56 57 60 61&lt;BR /&gt;2tPlaceNode(0), 11.8300&lt;/P&gt;
&lt;P&gt;SET OMP_NUM_THREADS=&lt;BR /&gt;SET KMP_HW_SUBSET=&lt;BR /&gt;SET OMP_PPROC_BIND=&lt;BR /&gt;SET KMP_AFFINITY=&lt;BR /&gt;SET OMP_PLACES={0:3,4:3,8:3,12:3,16:3,20:3,24:3,28:3,32:3,36:3,40:3,44:3,48:3,52:3,56:3,60:3}&lt;BR /&gt;OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 1 2 4 5 6 8 9 10 12 13 14 16 17 18 20 21 22 24 25 26 28 29 30 32 33 34 36 37 38 40 41 42 44 45 46 48 49 50 52 53 54 56 57 58 60 61 62&lt;BR /&gt;3tPlaceNode(0), 16.1700&lt;/P&gt;
&lt;P&gt;SET OMP_NUM_THREADS=&lt;BR /&gt;SET KMP_HW_SUBSET=&lt;BR /&gt;SET OMP_PPROC_BIND=&lt;BR /&gt;SET KMP_AFFINITY=&lt;BR /&gt;SET OMP_PLACES={0:64}&lt;BR /&gt;OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63&lt;BR /&gt;4tPlaceNode(0), 19.6800&lt;/P&gt;
&lt;P&gt;NOTE&lt;/P&gt;
&lt;P&gt;The slow down is NOT a case of HyperThreading is slower than no HyperThreading, rather it is a case of poor thread selection in MKL when it subsets the threads for the function (given the size of the problem). To wit:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="2021-01-02_14-00-58.jpg" style="width: 862px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/14511i1AE01390122AD507/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="2021-01-02_14-00-58.jpg" alt="2021-01-02_14-00-58.jpg" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;The bottom labels are # cores used,&amp;nbsp; cores are spread across NUMA nodes, then within node. In nearly all cases using 2, 3 or 4 HT/core was faster. It is undetermined when 4t/c, and to lesser extent 3t/c is so jagged. Lack of knowledge of what is going on inside MKL hinders further investigation.&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Sat, 02 Jan 2021 20:12:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1242717#M30600</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2021-01-02T20:12:48Z</dc:date>
    </item>
    <item>
      <title>Re: MKL Performance Improvement Suggestion</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1242721#M30601</link>
      <description>&lt;P&gt;Here is a chart of percentage improvement across cores 1, 2, 3, &amp;amp; 4 t/c&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="2021-01-02_14-29-18.jpg" style="width: 999px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/14512iBCAF0FBDCA0F7C6F/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="2021-01-02_14-29-18.jpg" alt="2021-01-02_14-29-18.jpg" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;In this test using 2HTs/core adds between 20% and 30% boost in performance. YMMV&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Sat, 02 Jan 2021 20:33:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1242721#M30601</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2021-01-02T20:33:56Z</dc:date>
    </item>
    <item>
      <title>Re:MKL Performance Improvement Suggestion</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1243797#M30608</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for providing your suggestions. We are forwarding this query to the concerned team.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Rahul&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 06 Jan 2021 10:23:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1243797#M30608</guid>
      <dc:creator>RahulV_intel</dc:creator>
      <dc:date>2021-01-06T10:23:15Z</dc:date>
    </item>
    <item>
      <title>Re: Re:MKL Performance Improvement Suggestion</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1244298#M30616</link>
      <description>&lt;P&gt;There is better diagnosis of the issue &lt;A href="https://community.intel.com/t5/Intel-Fortran-Compiler/COARRAY-process-pinning-bug/m-p/1244239/highlight/false#M153769" target="_self"&gt;here&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;It seems that MKL is using the calling Process affinity mask instead of the calling Thread affinity mask. (On Windows)&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 07 Jan 2021 20:41:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1244298#M30616</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2021-01-07T20:41:29Z</dc:date>
    </item>
    <item>
      <title>Re:MKL Performance Improvement Suggestion</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1284327#M31374</link>
      <description>&lt;P&gt;Hi Jim,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;We can no longer access to KNL  systems in order to validate your finding.&lt;/P&gt;&lt;P&gt;I apologize for the inconvenience.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Best,&lt;/P&gt;&lt;P&gt;Khang&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 25 May 2021 05:06:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1284327#M31374</guid>
      <dc:creator>Khang_N_Intel</dc:creator>
      <dc:date>2021-05-25T05:06:32Z</dc:date>
    </item>
    <item>
      <title>Re:MKL Performance Improvement Suggestion</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1292842#M31596</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Can you please let us know the oneAPI version used ?&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 24 Jun 2021 08:33:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1292842#M31596</guid>
      <dc:creator>MRajesh_intel</dc:creator>
      <dc:date>2021-06-24T08:33:54Z</dc:date>
    </item>
    <item>
      <title>Re:MKL Performance Improvement Suggestion</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1294197#M31617</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;We are closing this thread as we no longer support KNL machines. Please visit the system requirements for further information. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Link: &lt;A href="https://software.intel.com/content/www/us/en/develop/articles/oneapi-math-kernel-library-system-requirements.html" target="_blank"&gt;https://software.intel.com/content/www/us/en/develop/articles/oneapi-math-kernel-library-system-requirements.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Have a Good day.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;Rajesh&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 29 Jun 2021 05:48:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Performance-Improvement-Suggestion/m-p/1294197#M31617</guid>
      <dc:creator>MRajesh_intel</dc:creator>
      <dc:date>2021-06-29T05:48:13Z</dc:date>
    </item>
  </channel>
</rss>

