<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Is there any parallel version in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Rectangular-matrix-Inplace-transpose-performance-issue/m-p/1144658#M26604</link>
    <description>&lt;P&gt;Is there any parallel version of mkl_simatcopy??&lt;/P&gt;</description>
    <pubDate>Tue, 14 May 2019 04:17:30 GMT</pubDate>
    <dc:creator>Gupta__Shubham1</dc:creator>
    <dc:date>2019-05-14T04:17:30Z</dc:date>
    <item>
      <title>MKL Rectangular matrix Inplace transpose performance issue</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Rectangular-matrix-Inplace-transpose-performance-issue/m-p/1144656#M26602</link>
      <description>&lt;P&gt;I want an in place memory transpose of very large matrix. I am using mkl_simatcopy. But I am observing some performance issue while transposing inplace. I am currently using&amp;nbsp;&amp;nbsp;Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz having 72 physical cores and redhat os.&lt;/P&gt;&lt;P&gt;My observation is that, when I perform transpose&amp;nbsp; operation, only single core is used and it is not using all cores. I have tried all environment variables like MK_NUM_THREADS, MKL_DYNAMIC="FALSE" etc.&amp;nbsp; My compilation script is as follows :&lt;BR /&gt;&lt;BR /&gt;gcc&amp;nbsp; -std=c99&amp;nbsp;&amp;nbsp;&amp;nbsp; -m64 -I $MKLROOT/include transpose.c&amp;nbsp; ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_cdft_core.a ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_tbb_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_openmpi_ilp64.a -Wl,--end-group&amp;nbsp; -lstdc++ -lpthread -lm -ldl -o transpose.out&lt;BR /&gt;&lt;BR /&gt;Timings obtained are as follows&lt;BR /&gt;&lt;BR /&gt;Sno.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; No. of Rows&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; No. of Cols&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Time(in sec)&lt;BR /&gt;&amp;nbsp;1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 16384&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 8192&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 16&lt;BR /&gt;&amp;nbsp;2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 16384&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 32768&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 68&amp;nbsp;&lt;BR /&gt;&amp;nbsp;3&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 32768&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 65536&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 233&lt;BR /&gt;&lt;BR /&gt;Data Type is float. Please let me know , if there is an efficient way to transpose inplace or how can we port to multiple cores or how can we reduce this execution time.&lt;/P&gt;&lt;P&gt;Below is code snippet of transpose.c:&lt;/P&gt;&lt;P&gt;int main(int argc,char *argv[])&lt;BR /&gt;{&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(argc!=3)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; printf("Usage : exe NoofScan and NoofPix \n");&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; exit(0);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; unsigned long noOfScan = atol(argv[1]);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; unsigned long noOfPix = atol(argv[2]);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; printf("-----&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;nbsp; noOfScan = %d and noOfPix =%d \n",noOfScan,noOfPix);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; size_t nEle = noOfScan * noOfPix;&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; float *data = (float *)calloc(nEle,sizeof(float));&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; initalizeData(data,noOfScan,noOfPix);&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;long nt = mkl_get_max_threads();&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; printf("No Of threads are = %d \n",nt);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mkl_set_num_threads_local(nt);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; //mkl_set_num_threads(nt);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; double time1 = cpuSecond();&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mkl_simatcopy('R','T',noOfScan,noOfPix,1,data,noOfPix,noOfScan);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; printf("Time elapsed is %lf \n",cpuSecond()-time1);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; memset(data,0,nEle*sizeof(float));&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; free(data);&lt;BR /&gt;}&lt;/P&gt;</description>
      <pubDate>Mon, 13 May 2019 09:04:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Rectangular-matrix-Inplace-transpose-performance-issue/m-p/1144656#M26602</guid>
      <dc:creator>Gupta__Shubham1</dc:creator>
      <dc:date>2019-05-13T09:04:11Z</dc:date>
    </item>
    <item>
      <title>The bulk of the work of</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Rectangular-matrix-Inplace-transpose-performance-issue/m-p/1144657#M26603</link>
      <description>&lt;P&gt;The bulk of the work of forming the transpose is being performed in a library subroutine. What will matter as far as performance is concerned is whether/how well the library subroutine is parallelized. Your changing compiler options (or making efforts to parallelize the code from which the transpose routine is called) can have not any effect on the performance of the library subroutine.&amp;nbsp;&lt;/P&gt;&lt;P&gt;If the MKL library that you use contains a parallel version of&amp;nbsp;mkl_simatcopy(), its run time can be affected by setting MKL_NUM_THREADS, etc. However, the timings that you reported indicate that the time taken by the routine is proportional to the number of elements&amp;nbsp;in the matrix being transposed, which is exactly what one expects from a serial version of the routine.&lt;/P&gt;</description>
      <pubDate>Tue, 14 May 2019 00:08:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Rectangular-matrix-Inplace-transpose-performance-issue/m-p/1144657#M26603</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2019-05-14T00:08:25Z</dc:date>
    </item>
    <item>
      <title>Is there any parallel version</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Rectangular-matrix-Inplace-transpose-performance-issue/m-p/1144658#M26604</link>
      <description>&lt;P&gt;Is there any parallel version of mkl_simatcopy??&lt;/P&gt;</description>
      <pubDate>Tue, 14 May 2019 04:17:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Rectangular-matrix-Inplace-transpose-performance-issue/m-p/1144658#M26604</guid>
      <dc:creator>Gupta__Shubham1</dc:creator>
      <dc:date>2019-05-14T04:17:30Z</dc:date>
    </item>
    <item>
      <title>not. You may submit the</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Rectangular-matrix-Inplace-transpose-performance-issue/m-p/1144659#M26605</link>
      <description>&lt;P&gt;not. You may submit the feature request regard to this topic to the intel online service center.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 14 May 2019 07:14:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Rectangular-matrix-Inplace-transpose-performance-issue/m-p/1144659#M26605</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2019-05-14T07:14:55Z</dc:date>
    </item>
  </channel>
</rss>

