<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Option number 2 is the one I in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Using-sgemm-with-multiple-cores/m-p/1121169#M24956</link>
    <description>&lt;P&gt;Option number 2 is the one I believe I wanted. Thank you very much for you help.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 11 Jul 2016 15:52:18 GMT</pubDate>
    <dc:creator>Brandon_R_</dc:creator>
    <dc:date>2016-07-11T15:52:18Z</dc:date>
    <item>
      <title>Using sgemm with multiple cores</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Using-sgemm-with-multiple-cores/m-p/1121167#M24954</link>
      <description>&lt;P&gt;Hello! I am trying to implement sgemm matrix multiplication on multiple physical cores and I am a little confused on how to do so.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Say I have obtained 9 physical cores from an HPC system and I want sgemm to use all of these cores to do the matrix multiplication. In this case I do not want to use multithreading on these 9 cores, only these 9 cores as a whole. So in a way I guess you could say that the 9 cores are the threads to be used by sgemm. Below is some code I have created, which I believe implements what I want to do. Is this implementation correct? &amp;nbsp;&lt;/P&gt;

&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;program&lt;/SPAN&gt;&lt;SPAN class="s2"&gt; &lt;/SPAN&gt;&lt;SPAN class="s3"&gt;mkl_matrixmul&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p1"&gt;&lt;SPAN class="s3" style="font-size: 1em; line-height: 1.5;"&gt;use&lt;/SPAN&gt;&lt;SPAN class="s2" style="font-size: 1em; line-height: 1.5;"&gt; &lt;/SPAN&gt;&lt;SPAN class="s4" style="font-size: 1em; line-height: 1.5;"&gt;mpi&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p2"&gt;&lt;SPAN class="s3"&gt;implicit&lt;/SPAN&gt;&lt;SPAN class="s2"&gt; &lt;/SPAN&gt;&lt;SPAN class="s5"&gt;none&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p4"&gt;&lt;SPAN class="s5"&gt;integer&lt;/SPAN&gt;&lt;SPAN class="s2"&gt; ::&lt;/SPAN&gt;&lt;SPAN class="s3"&gt; N,max_threads,mkl_get_max_threads&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p2"&gt;&lt;SPAN class="s5"&gt;real&lt;/SPAN&gt;&lt;SPAN class="s2"&gt;, &lt;/SPAN&gt;&lt;SPAN class="s3"&gt;allocatable&lt;/SPAN&gt;&lt;SPAN class="s2"&gt;, &lt;/SPAN&gt;&lt;SPAN class="s3"&gt;dimension&lt;/SPAN&gt;&lt;SPAN class="s2"&gt;(:,:) ::&lt;/SPAN&gt;&lt;SPAN class="s6"&gt; A,B,C&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p4"&gt;&lt;SPAN class="s5"&gt;integer&lt;/SPAN&gt;&lt;SPAN class="s2"&gt; ::&lt;/SPAN&gt;&lt;SPAN class="s3"&gt; ierror,num_cores,my_rank&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p5"&gt;&lt;SPAN class="s3"&gt;double precision&lt;/SPAN&gt;&lt;SPAN class="s2"&gt; ::&lt;/SPAN&gt;&lt;SPAN class="s6"&gt; time1,time2&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p6"&gt;&lt;SPAN class="s1"&gt;CALL&lt;/SPAN&gt;&lt;SPAN class="s2"&gt; &lt;/SPAN&gt;&lt;SPAN class="s4"&gt;MPI_Init&lt;/SPAN&gt;&lt;SPAN class="s2"&gt;(ierror) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN class="s3"&gt;!Flag for error &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p6"&gt;&lt;SPAN class="s1"&gt;CALL&lt;/SPAN&gt;&lt;SPAN class="s2"&gt; &lt;/SPAN&gt;&lt;SPAN class="s4"&gt;MPI_COMM_Size&lt;/SPAN&gt;&lt;SPAN class="s2"&gt;(MPI_COMM_WORLD,num_cores,ierror) &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN class="s3"&gt;!puts in the number of cores into num_cores &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p6"&gt;&lt;SPAN class="s1"&gt;CALL&lt;/SPAN&gt;&lt;SPAN class="s2"&gt; &lt;/SPAN&gt;&lt;SPAN class="s4"&gt;MPI_Comm_rank&lt;/SPAN&gt;&lt;SPAN class="s2"&gt;(MPI_COMM_WORLD,my_rank,ierror) &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN class="s3"&gt;!defining the variable for the rank of the core &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p7"&gt;&lt;SPAN class="s1"&gt;CALL&lt;/SPAN&gt;&lt;SPAN class="s3"&gt; &lt;/SPAN&gt;&lt;SPAN class="s4"&gt;MPI_BARRIER&lt;/SPAN&gt;&lt;SPAN class="s3"&gt;(MPI_COMM_WORLD,ierror)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p7"&gt;&lt;SPAN class="s1"&gt;if&lt;/SPAN&gt;&lt;SPAN class="s3"&gt;(my_rank == 0)&lt;/SPAN&gt;&lt;SPAN class="s1"&gt;then&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p6"&gt;&lt;SPAN class="s2"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN class="s3"&gt;!starting the timer&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p7"&gt;&lt;SPAN class="s3"&gt;&amp;nbsp;&amp;nbsp; time1 = MPI_Wtime()&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p2"&gt;&lt;SPAN class="s3"&gt;end if&lt;/SPAN&gt;&lt;SPAN class="s2"&gt;&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p7"&gt;&lt;SPAN class="s3"&gt;N = 61740&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p7"&gt;&lt;SPAN class="s1"&gt;Allocate&lt;/SPAN&gt;&lt;SPAN class="s3"&gt;(A(N,N),B(N,N),C(N,N))&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p6"&gt;&lt;SPAN class="s3"&gt;A = 1.0&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p6"&gt;&lt;SPAN class="s3"&gt;B = 2.0&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p6"&gt;&lt;SPAN class="s3"&gt;C = 0.0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;call&lt;/SPAN&gt;&lt;SPAN class="s2"&gt; &lt;/SPAN&gt;&lt;SPAN class="s3"&gt;mkl_set_num_threads&lt;/SPAN&gt;&lt;SPAN class="s2"&gt;(num_cores)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p7"&gt;&lt;SPAN class="s1"&gt;call&lt;/SPAN&gt;&lt;SPAN class="s3"&gt; &lt;/SPAN&gt;&lt;SPAN class="s4"&gt;sgemm&lt;/SPAN&gt;&lt;SPAN class="s3"&gt;(&lt;/SPAN&gt;&lt;SPAN class="s7"&gt;'N'&lt;/SPAN&gt;&lt;SPAN class="s3"&gt;,&lt;/SPAN&gt;&lt;SPAN class="s7"&gt;'N'&lt;/SPAN&gt;&lt;SPAN class="s3"&gt;,N,N,N,1.0,A,N,B,N,1.0,C,N)&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p7"&gt;&lt;SPAN class="s1"&gt;CALL&lt;/SPAN&gt;&lt;SPAN class="s3"&gt; &lt;/SPAN&gt;&lt;SPAN class="s4"&gt;MPI_BARRIER&lt;/SPAN&gt;&lt;SPAN class="s3"&gt;(MPI_COMM_WORLD,ierror)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p7"&gt;&lt;SPAN class="s1"&gt;if&lt;/SPAN&gt;&lt;SPAN class="s3"&gt;(my_rank == 0)&lt;/SPAN&gt;&lt;SPAN class="s1"&gt;then&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p6"&gt;&lt;SPAN class="s2"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN class="s3"&gt;!printing the elapsed time &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p7"&gt;&lt;SPAN class="s3"&gt;&amp;nbsp;&amp;nbsp; time2 = MPI_Wtime()&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p7"&gt;&lt;SPAN class="s3"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN class="s1"&gt;print&lt;/SPAN&gt;&lt;SPAN class="s3"&gt; *, &lt;/SPAN&gt;&lt;SPAN class="s7"&gt;'elapsed time'&lt;/SPAN&gt;&lt;SPAN class="s3"&gt; , time2 - time1&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p7"&gt;&lt;SPAN class="s3"&gt;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN class="s1"&gt;print&lt;/SPAN&gt;&lt;SPAN class="s3"&gt; *, C(1,2)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p2"&gt;&lt;SPAN class="s3"&gt;end if&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;CALL&lt;/SPAN&gt;&lt;SPAN class="s2"&gt; &lt;/SPAN&gt;&lt;SPAN class="s3"&gt;MPI_Finalize&lt;/SPAN&gt;&lt;SPAN class="s2"&gt;(ierror)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;end program&lt;/SPAN&gt;&lt;SPAN class="s2"&gt; &lt;/SPAN&gt;&lt;SPAN class="s3"&gt;mkl_matrixmul&lt;/SPAN&gt;&lt;/P&gt;

&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P class="p1"&gt;Also if it helps, I am using a Sandy Bridge node with 256 GB of memory.&amp;nbsp;&lt;/P&gt;

&lt;P class="p1"&gt;Thank you,&amp;nbsp;&lt;/P&gt;

&lt;P class="p1"&gt;Brandon&amp;nbsp;&lt;/P&gt;

&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 08 Jul 2016 17:15:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Using-sgemm-with-multiple-cores/m-p/1121167#M24954</guid>
      <dc:creator>Brandon_R_</dc:creator>
      <dc:date>2016-07-08T17:15:41Z</dc:date>
    </item>
    <item>
      <title>Hi Brandon,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Using-sgemm-with-multiple-cores/m-p/1121168#M24955</link>
      <description>&lt;P&gt;Hi Brandon,&lt;/P&gt;

&lt;P&gt;Not sure if I understand&amp;nbsp; about your question correctly.&lt;/P&gt;

&lt;P&gt;1) In generally, as MKL was mulithreaded by OpenMP run-time library,&amp;nbsp; if you call mkl&amp;nbsp;sgemm directly as the&amp;nbsp;MKL fortran&amp;nbsp;sample, &lt;A href="https://community.intel.com/legacyfs/online/drupal_files/mkl_fortran_samples_05162016.zip"&gt;https://software.intel.com/sites/default/files/mkl_fortran_samples_05162016.zip&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;and compile it if with ifort&amp;nbsp; your.for -mkl.&amp;nbsp; The sgemm can run with 9 physical cores on&amp;nbsp;one nodes&amp;nbsp;automatically.&amp;nbsp; &lt;STRONG&gt;User don't need write threading code. &lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;2) You can use MPI process, then let process 0 to call sgemm (please note, there is MPI version sgemm psgemm() in MKL). Then let sgemm use 9 &amp;nbsp;OpenMP threads, say&amp;nbsp;mkl_set_num_threads(9)&lt;/P&gt;

&lt;P&gt;I guess the two ways should be ok.&amp;nbsp; but&amp;nbsp;they looks still being&amp;nbsp;using 9 threads on 9 cores.&amp;nbsp; In your case,&amp;nbsp;may you&amp;nbsp;want to cooridinate the OpenMP threads&amp;nbsp;with MPI Process, (or we call it as OpenMP Affinity)?&lt;/P&gt;

&lt;P&gt;If yes, you may consider the&amp;nbsp;OpenMP Place or&amp;nbsp;Affinity in OpenMP&amp;nbsp;documentation, please see MKL user guide, like&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/node/599522"&gt;https://software.intel.com/en-us/node/599522&lt;/A&gt;&amp;nbsp; =&amp;gt; control MPI and OpenMP number&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;A href="https://software.intel.com/en-us/node/528552"&gt;https://software.intel.com/en-us/node/528552&lt;/A&gt;&amp;nbsp; =&amp;gt; control OpenMP thread affinity to core.&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;BR /&gt;
	Ying&lt;/P&gt;</description>
      <pubDate>Mon, 11 Jul 2016 04:17:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Using-sgemm-with-multiple-cores/m-p/1121168#M24955</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2016-07-11T04:17:30Z</dc:date>
    </item>
    <item>
      <title>Option number 2 is the one I</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Using-sgemm-with-multiple-cores/m-p/1121169#M24956</link>
      <description>&lt;P&gt;Option number 2 is the one I believe I wanted. Thank you very much for you help.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 11 Jul 2016 15:52:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Using-sgemm-with-multiple-cores/m-p/1121169#M24956</guid>
      <dc:creator>Brandon_R_</dc:creator>
      <dc:date>2016-07-11T15:52:18Z</dc:date>
    </item>
  </channel>
</rss>

