<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic I can run the program with  in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007831#M105290</link>
    <description>&lt;P&gt;I can run the program with "time", cpu and wall time output is consistent with the overall values reported by time:&lt;/P&gt;

&lt;P&gt;cp003421_matmul&amp;gt; OMP_NUM_THREADS=4 time ./a.out 6000&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Running with array sizes&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6000 by&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6000&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.480&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.121&amp;nbsp;&amp;nbsp;&amp;nbsp; init&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 17.740&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.437&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ikj&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 17.960&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.494&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jki&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 533082291842201.125&lt;BR /&gt;
	36.09user 0.16system 0:09.21elapsed 393%CPU (0avgtext+0avgdata 877168maxresident)k&lt;BR /&gt;
	0inputs+0outputs (0major+3881minor)pagefaults 0swaps&lt;/P&gt;

&lt;P&gt;-opt-report reveales that (with option -parallel)&amp;nbsp;the 14.0 compiler "forgets" to replace the matrix multiplication with the matmul intrinsic, while the 13.1 compiler does this replacement. I wonder why this feature has been dropped.&lt;/P&gt;

&lt;P&gt;Many loops require a par-threshold of 99 or less instead of the default 100 to be parallelized:&lt;/P&gt;

&lt;P&gt;&amp;gt; ifort -O3 -parallel -par-threshold99 -par-report -DARDIM=4000 -DSIZEARGS -mavx matmul.F&lt;BR /&gt;
	matmul.F(77): (col. 18) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(82): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(90): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(103): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(106): (col. 7) remark: PERMUTED LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(119): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(120): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(136): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	&amp;gt; ifort -O3 -parallel -par-threshold -par-report -DARDIM=4000 -DSIZEARGS -mavx matmul.F&amp;nbsp;&lt;BR /&gt;
	matmul.F(106): (col. 7) remark: PERMUTED LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(120): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;/P&gt;

&lt;P&gt;In parallel mode both matrix multiplikations are treated the same, loops are permuted if necessary. So I wonder why loop order jki is treated differently in sequential mode. In fact - starting with compiler version 10 - jki is the only loop order among the 6 possible orders to show reduced performance.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 05 May 2014 15:00:52 GMT</pubDate>
    <dc:creator>Axel_P_</dc:creator>
    <dc:date>2014-05-05T15:00:52Z</dc:date>
    <item>
      <title>Optimization problem with allocatable arrays</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007829#M105288</link>
      <description>&lt;P&gt;I detected a performance regression in ifort 14.0 when using allocatable arrays. Optimization is perfect when using fixed size arrays, optimization does not work properly when using autoparallelization and allocatable arrays. This did work with full performance with ifort 13.1.&lt;/P&gt;

&lt;P&gt;When single threaded code is produced, there is a partial - but extremely significant - loss of performance when using allocatable arrays.&lt;/P&gt;

&lt;P&gt;I attached two files, ifortregression.txt, showing compiler version, compiler parameters and execution times, and matmul.F, the source code of the test programm.&lt;/P&gt;</description>
      <pubDate>Mon, 05 May 2014 11:11:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007829#M105288</guid>
      <dc:creator>Axel_P_</dc:creator>
      <dc:date>2014-05-05T11:11:39Z</dc:date>
    </item>
    <item>
      <title>Apparently, this clockx</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007830#M105289</link>
      <description>Apparently, this clockx function in my installation doesn't work with -parallel.  Generally speaking, it would be better to use Fortran system_clock, with arguments set to selected_integer_kind(12) in order to see the wall clock time of threaded applications.
If you did collect cpu time successfully in the threaded case, it would show the total time spent by all threads, so that parallelization would be expected to increase the reported time.

I notice that my compiler fails to perform the "blocked by 128" optimization on the second version with allocatable arrays.  It will be important to study the compiler reports such as those generated by -opt-report.  
The compiler fails to parallelize the first version when -parallel -DSIZEARGS is set, apparently because so much loop interchanging is required to achieve single thread optimization, and the outer loop on j is required for effective parallelization.  The compiler needs to know that the problem is large enough to benefit from parallelization at the possible cost of single thread performance.
In the allocatable array case, in the absence of loop count directives and the like, the compiler will assume much smaller arrays than what you have set in the fixed dimension version.  Apparently, this determines whether the compiler chooses cache blocking optimizations.</description>
      <pubDate>Mon, 05 May 2014 12:37:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007830#M105289</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-05-05T12:37:00Z</dc:date>
    </item>
    <item>
      <title>I can run the program with</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007831#M105290</link>
      <description>&lt;P&gt;I can run the program with "time", cpu and wall time output is consistent with the overall values reported by time:&lt;/P&gt;

&lt;P&gt;cp003421_matmul&amp;gt; OMP_NUM_THREADS=4 time ./a.out 6000&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;Running with array sizes&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6000 by&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6000&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.480&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.121&amp;nbsp;&amp;nbsp;&amp;nbsp; init&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 17.740&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.437&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ikj&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 17.960&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.494&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jki&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 533082291842201.125&lt;BR /&gt;
	36.09user 0.16system 0:09.21elapsed 393%CPU (0avgtext+0avgdata 877168maxresident)k&lt;BR /&gt;
	0inputs+0outputs (0major+3881minor)pagefaults 0swaps&lt;/P&gt;

&lt;P&gt;-opt-report reveales that (with option -parallel)&amp;nbsp;the 14.0 compiler "forgets" to replace the matrix multiplication with the matmul intrinsic, while the 13.1 compiler does this replacement. I wonder why this feature has been dropped.&lt;/P&gt;

&lt;P&gt;Many loops require a par-threshold of 99 or less instead of the default 100 to be parallelized:&lt;/P&gt;

&lt;P&gt;&amp;gt; ifort -O3 -parallel -par-threshold99 -par-report -DARDIM=4000 -DSIZEARGS -mavx matmul.F&lt;BR /&gt;
	matmul.F(77): (col. 18) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(82): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(90): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(103): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(106): (col. 7) remark: PERMUTED LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(119): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(120): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(136): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	&amp;gt; ifort -O3 -parallel -par-threshold -par-report -DARDIM=4000 -DSIZEARGS -mavx matmul.F&amp;nbsp;&lt;BR /&gt;
	matmul.F(106): (col. 7) remark: PERMUTED LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(120): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;/P&gt;

&lt;P&gt;In parallel mode both matrix multiplikations are treated the same, loops are permuted if necessary. So I wonder why loop order jki is treated differently in sequential mode. In fact - starting with compiler version 10 - jki is the only loop order among the 6 possible orders to show reduced performance.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 05 May 2014 15:00:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007831#M105290</guid>
      <dc:creator>Axel_P_</dc:creator>
      <dc:date>2014-05-05T15:00:52Z</dc:date>
    </item>
    <item>
      <title>-opt-matmul should be an</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007832#M105291</link>
      <description>-opt-matmul should be an effective way to parallelize.  I don't know why it may no longer be automatic with parallel.</description>
      <pubDate>Mon, 05 May 2014 17:56:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007832#M105291</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-05-05T17:56:16Z</dc:date>
    </item>
    <item>
      <title>Unfortunately -opt-matmul is</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007833#M105292</link>
      <description>&lt;P&gt;Unfortunately -opt-matmul is ignored by the 14.0 compiler if allocatable arrays are used. My full test program (source attached) shows that even calling the matmul intrinsic directly gives a slower executable than with compiler version 13.1. Calling dgemm directly is also slower, and this is definitely a compiler feature, not a library feature, since I can run both executables with either libMKL, there is no difference in execution speed.&lt;/P&gt;

&lt;P&gt;&amp;gt; ifort14.0 -O3 -parallel -par-threshold99 -DARDIM=4000 -DSIZEARGS -DDGEMM -mavx -mkl crunchtbvar.F&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;gt; OMP_NUM_THREADS=4 time ./a.out 4000&lt;BR /&gt;
	&amp;nbsp;Running with array sizes&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000 by&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.210&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.056&amp;nbsp;&amp;nbsp;&amp;nbsp; init&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 9.520&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.379&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ijk&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 9.770&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.444&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ikj&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 9.730&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.434&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jik&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 9.590&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.398&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jki&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 9.730&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.436&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; kij&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 9.540&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.385&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; kji&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 105312418747995.297&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 9.520&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.387&amp;nbsp; matmul&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6.840&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.050&amp;nbsp;&amp;nbsp; dgemm&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 105312418747995.297&lt;BR /&gt;
	74.41user 0.10system 0:19.02elapsed 391%CPU (0avgtext+0avgdata 400392maxresident)k&lt;BR /&gt;
	0inputs+0outputs (0major+5068minor)pagefaults 0swaps&lt;BR /&gt;
	&amp;gt; ifort13.1 -O3 -parallel -par-threshold99 -DARDIM=4000 -DSIZEARGS -DDGEMM -mavx -mkl crunchtbvar.F&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;gt; OMP_NUM_THREADS=4 time ./a.out 4000&lt;BR /&gt;
	&amp;nbsp;Running with array sizes&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000 by&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.230&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.059&amp;nbsp;&amp;nbsp;&amp;nbsp; init&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6.300&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.574&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ijk&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.300&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.329&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ikj&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.380&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.346&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jik&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.390&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.348&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jki&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.420&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.355&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; kij&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.450&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.361&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; kji&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 105312418747995.312&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.390&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.356&amp;nbsp; matmul&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.420&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.357&amp;nbsp;&amp;nbsp; dgemm&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 105312418747995.297&lt;BR /&gt;
	44.27user 0.08system 0:11.22elapsed 395%CPU (0avgtext+0avgdata 400376maxresident)k&lt;BR /&gt;
	0inputs+0outputs (0major+3820minor)pagefaults 0swaps&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 06 May 2014 07:41:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007833#M105292</guid>
      <dc:creator>Axel_P_</dc:creator>
      <dc:date>2014-05-06T07:41:41Z</dc:date>
    </item>
    <item>
      <title>You have so many different</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007834#M105293</link>
      <description>&lt;P&gt;You have so many different combinations of compiler options and tests that it's hard to see if there is any regression or not.&amp;nbsp; Can we just focus on one case at a time and make sure we do apples-to-apples comparisons?&amp;nbsp; In fact, the following case shows that 14.0.2&amp;nbsp;is&amp;nbsp;about 2x&amp;nbsp;faster than&amp;nbsp;13.1.2:&lt;/P&gt;

&lt;P&gt;===================14.0.2=============&lt;/P&gt;

&lt;P&gt;$ ifort -V&lt;BR /&gt;
	Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.144 Build 20140120&lt;/P&gt;

&lt;P&gt;$ ifort -O3 -parallel -par_threshold90 -par-report -DARDIM=4000 matmul.F -o matmul-14.0.2.144-pt90.x&lt;BR /&gt;
	matmul.F(73): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED&lt;BR /&gt;
	matmul.F(82): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED&lt;BR /&gt;
	matmul.F(90): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED&lt;BR /&gt;
	matmul.F(103): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED&lt;BR /&gt;
	matmul.F(119): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED&lt;BR /&gt;
	matmul.F(136): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED&lt;BR /&gt;
	$ export OMP_NUM_THREADS=1&lt;BR /&gt;
	$ time ./matmul-14.0.2.144-pt90.x&lt;BR /&gt;
	&amp;nbsp;Running with array sizes&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000 by&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.190&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.195&amp;nbsp;&amp;nbsp;&amp;nbsp; init&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.500&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.504&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ikj&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.500&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.498&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jki&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 105312418747995.250&lt;/P&gt;

&lt;P&gt;real&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m15.850s&lt;BR /&gt;
	user&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m11.123s&lt;BR /&gt;
	sys&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m0.087s&lt;BR /&gt;
	$ export OMP_NUM_THREADS=4&lt;BR /&gt;
	$ time ./matmul-14.0.2.144-pt90.x&lt;BR /&gt;
	&amp;nbsp;Running with array sizes&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000 by&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.200&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.054&amp;nbsp;&amp;nbsp;&amp;nbsp; init&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.620&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.406&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ikj&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.600&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.400&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jki&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 105312418747995.031&lt;/P&gt;

&lt;P&gt;real&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m3.173s&lt;BR /&gt;
	user&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m11.336s&lt;BR /&gt;
	sys&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m0.106s&lt;BR /&gt;
	$&lt;/P&gt;

&lt;P&gt;=========================13.1.2=====================&lt;/P&gt;

&lt;P&gt;$ ifort -V&lt;BR /&gt;
	Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.2.183 Build 20130514&lt;/P&gt;

&lt;P&gt;$ ifort -O3 -parallel -par_threshold90 -DARDIM=4000 matmul.F -par-report -o matmul-13.1.2.183-pt90.x&lt;BR /&gt;
	matmul.F(73): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(82): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(90): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(103): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(119): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	matmul.F(136): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.&lt;BR /&gt;
	$ export OMP_NUM_THREADS=1&lt;BR /&gt;
	$ time ./matmul-13.1.2.183-pt90.x&lt;BR /&gt;
	&amp;nbsp;Running with array sizes&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000 by&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.190&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.195&amp;nbsp;&amp;nbsp;&amp;nbsp; init&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 11.000&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp; 10.999&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ikj&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 10.970&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp; 10.970&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jki&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 105312418747995.250&lt;/P&gt;

&lt;P&gt;real&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m22.492s&lt;BR /&gt;
	user&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m22.101s&lt;BR /&gt;
	sys&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m0.075s&lt;BR /&gt;
	$ export OMP_NUM_THREADS=4&lt;BR /&gt;
	$ time ./matmul-13.1.2.183-pt90.x&lt;BR /&gt;
	&amp;nbsp;Running with array sizes&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000 by&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.200&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.053&amp;nbsp;&amp;nbsp;&amp;nbsp; init&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 10.990&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.753&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ikj&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 10.990&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.747&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jki&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 105312418747995.016&lt;/P&gt;

&lt;P&gt;real&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m7.385s&lt;BR /&gt;
	user&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m22.113s&lt;BR /&gt;
	sys&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0m0.088s&lt;BR /&gt;
	$&lt;/P&gt;

&lt;P&gt;Perhaps this might be related to your test machine, which you indicated was a Xeon E31240.&amp;nbsp; That's an older Sandy Bridge box with a smallish 8 MB cache.&amp;nbsp; I ran my tests on a Ivy Bridge box with 20 MB&lt;/P&gt;

&lt;P&gt;Patrick&lt;/P&gt;</description>
      <pubDate>Wed, 07 May 2014 16:29:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007834#M105293</guid>
      <dc:creator>pbkenned1</dc:creator>
      <dc:date>2014-05-07T16:29:00Z</dc:date>
    </item>
    <item>
      <title>The presentations on new</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007835#M105294</link>
      <description>The presentations on new versions of MKL raise suspicions that opt-matmul functionality is being replaced:
The following examples of code and link lines show how to partially inline Intel MKL functions in Fortran applications:

Include
mkl_inline_pp.fi, to be preprocessed by the Intel® Fortran Compiler preprocessor
mkl_inline.fi for each subroutine that calls *GEMM
#     include "mkl_inline_pp.fi"
      program   DGEMM_MAIN
      include 'mkl_inline.fi'
....
*      Call Intel MKL DGEMM
....
      call sub1()
      stop 1
      end

*     A subroutine that calls DGEMM 
      subroutine sub1
*      Need to include mkl_inline.fi for each subroutine that calls DGEMM
      include 'mkl_inline.fi'
*      Call Intel MKL DGEMM

      end
Compile with /fpp compiler option and MKL_INLINE preprocessor macro to use threaded Intel MKL:
ifort /DMKL_INLINE /fpp your_application.f mkl_intel_lp64.lib mkl_core.lib mkl_intel_thread.lib /Qopenmp -I%MKLROOT%/include
Compile with -fpp compiler option and MKL_INLINE_SEQ preprocessor macro to use Intel MKL in the sequential mode:
ifort /DMKL_INLINE_SEQ /fpp your_application.f mkl_intel_lp64.lib mkl_core.lib mkl_sequential.lib -I%MKLROOT%/include

The presenter declined to discuss this.</description>
      <pubDate>Wed, 07 May 2014 16:50:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007835#M105294</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-05-07T16:50:16Z</dc:date>
    </item>
    <item>
      <title>-opt-matmul does not use MKL</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007836#M105295</link>
      <description>&lt;P&gt;-opt-matmul does not use MKL directly - it calls into an MKL-derived routine in the Fortran support library, since Fortran needs more than the generic xGEMM can supply.&lt;/P&gt;</description>
      <pubDate>Wed, 07 May 2014 17:00:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007836#M105295</guid>
      <dc:creator>Steven_L_Intel1</dc:creator>
      <dc:date>2014-05-07T17:00:51Z</dc:date>
    </item>
    <item>
      <title>On the example originally</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007837#M105296</link>
      <description>On the example originally presented in this thread, using the mkl_inline_pp,  I get best performance at 2 threads (not significantly affected by SIZEARGS).
I am seeing a lack of optimization by opt-matmul with recent ifort versions when SIZEARGS is set, even with a MATMUL substitution in the source code.  Without SIZEARGS, Qopt-matmul gives 25% better performance than GEMM, with further improvement up to 4 threads.
So, the current opt-matmul seems to give an advantage not present in dgemm, when it works.
As Steve said, ifort opt-matmul uses its own entry point into MKL, unlike gfortran which uses gemm (but that is designed to work with MKL only on linux).
I was wondering, in view of comments about changes in opt-matmul, along with the advertising of the new include file interface for gemm, whether we should expect changes in support of opt-matmul.</description>
      <pubDate>Wed, 07 May 2014 18:06:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007837#M105296</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-05-07T18:06:08Z</dc:date>
    </item>
    <item>
      <title>Remarks on Patrick's comment:</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007838#M105297</link>
      <description>&lt;P&gt;Remarks on Patrick's comment:&lt;/P&gt;

&lt;P&gt;As I mentioned in ifortregression.txt the two compiler versions have different strategies regarding use of avx instructions. In single threaded mode both require the -mavx flag and linking against MKL does not improve performance; execution speed is the same for both compiler versions. In parallel mode&amp;nbsp;the 14.0.2 compiler's matmul intrinsic uses avx by default, while the 13.1.2 compiler ignores the -mavx flag and requires linking against MKL to use avx. That's why I use different compiler flags for the two versions.&lt;/P&gt;

&lt;P&gt;Without SIZEARGS both compilers produce executables with the&amp;nbsp;same performance, in fact the 14.0.2 compiler is somewhat better, since both loop orders give the same speed while with the 13.1.2 compiler loop order ikj is slower (and varies considerably when repeating thu run). With SIZEARGS the 14.0.2 compiler does not replace the loops with the matmul intrinsic even when -opt-matmul is specified.&lt;/P&gt;

&lt;P&gt;I believe that allocatable arrays are the standard for production software, therefor it is regrettable that the latest compiler version gives slower code.&lt;/P&gt;

&lt;P&gt;What really worries me is that the execution time of dgemm increases significantly when it is called with allocated arrays. It looks as if the memory layout of the allocated arrays is not as optimized as it used to be with the 13.1.2 compiler.&lt;/P&gt;

&lt;P&gt;Axel&lt;/P&gt;</description>
      <pubDate>Thu, 08 May 2014 08:29:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007838#M105297</guid>
      <dc:creator>Axel_P_</dc:creator>
      <dc:date>2014-05-08T08:29:46Z</dc:date>
    </item>
    <item>
      <title>Is your point that there are</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007839#M105298</link>
      <description>&lt;P&gt;Is your point that there are too many compiler options and directives to play with?&lt;/P&gt;

&lt;P&gt;To my surprise, the option -align array32byte doesn't prove useful. &amp;nbsp;avx and avx2 are helpful only at -O3.&lt;/P&gt;

&lt;P&gt;As I suggested, the loop count directives make a huge difference (&amp;gt;10x, if not setting par-threshold) to your allocatable array case. &amp;nbsp;Do you consider that unacceptable? &amp;nbsp;They don't entirely eliminate the difference, presumably because the generated code must still allow for the run-time determination of array sizes and so it makes more code version branches.&lt;/P&gt;

&lt;P&gt;Auto-parallelization seems unusually effective for this case, once you learn about using loop count directives when you deny the compiler the fixed dimension information. &amp;nbsp;Given that various MKL solutions are effective, this doesn't make a strong case for auto-parallelization.&lt;/P&gt;

&lt;P&gt;This seems to show a weakness of the opt-matmul scheme in that the loop count directives don't work for that case.&lt;/P&gt;</description>
      <pubDate>Thu, 08 May 2014 12:37:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007839#M105298</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-05-08T12:37:00Z</dc:date>
    </item>
    <item>
      <title>Correction:</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007840#M105299</link>
      <description>&lt;P&gt;Correction:&lt;/P&gt;

&lt;P&gt;Some of the performance deficits I have detected look like an initialization effect of MKL. When I repeat the first nested loop (ifort13 -parallel -mkl) or the dgemm call (ifort14 -parallel -mkl) in the source, the second execution of the identical code sections runs at full speed. Sorry that I didn't notice that earlier. I do not yet understand why this initialization overhead varies so much when I repeat the runs.&lt;/P&gt;

&lt;P&gt;Two problems remain: Both compiler versions produce slow code for the jki loop order in single threaded mode. When using -parallel all 6 loop orders and the explicit matmul call run at the same speed. The compiler version 14.0.2 produces slower code when autoparallelization is used in connection with allocatable arrays. This seems to be due to no longer offering the option to use MKL routines by specifying -mkl.&lt;/P&gt;</description>
      <pubDate>Fri, 09 May 2014 07:18:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007840#M105299</guid>
      <dc:creator>Axel_P_</dc:creator>
      <dc:date>2014-05-09T07:18:01Z</dc:date>
    </item>
    <item>
      <title>-mkl option is only a</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007841#M105300</link>
      <description>&lt;P&gt;-mkl option is only a shortcut for linking commonly used groups of MKL libraries. &amp;nbsp;It doesn't have effects such as implying -opt-matmul.&lt;/P&gt;</description>
      <pubDate>Fri, 09 May 2014 10:16:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007841#M105300</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-05-09T10:16:13Z</dc:date>
    </item>
    <item>
      <title>As far as compiler version 13</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007842#M105301</link>
      <description>&lt;P&gt;As far as compiler version 13.1.2 is concerned, -mkl does improve performance. I can only guess that identically named internal procedures are replaced by higher performance mkl functions:&lt;/P&gt;

&lt;P&gt;&amp;gt; ifort --version&lt;BR /&gt;
	ifort (IFORT) 13.1.2 20130514&lt;BR /&gt;
	Copyright (C) 1985-2013 Intel Corporation.&amp;nbsp; All rights reserved.&lt;BR /&gt;
	&amp;gt; ifort -O3 -parallel -par-threshold99 -DARDIM=4000 -DSIZEARGS matmul.F&lt;BR /&gt;
	&amp;gt; OMP_NUM_THREADS=4 ./a.out&lt;BR /&gt;
	&amp;nbsp;Running with array sizes&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000 by&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.250&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.068&amp;nbsp;&amp;nbsp;&amp;nbsp; init&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 10.140&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.537&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ikj&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 10.160&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.540&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jki&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 105312418747995.031&lt;BR /&gt;
	&amp;gt; ifort -O3 -parallel -par-threshold99 -mkl -DARDIM=4000 -DSIZEARGS matmul.F&lt;BR /&gt;
	&amp;gt; OMP_NUM_THREADS=4 ./a.out&lt;BR /&gt;
	&amp;nbsp;Running with array sizes&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000 by&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.210&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.056&amp;nbsp;&amp;nbsp;&amp;nbsp; init&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6.080&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.518&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ikj&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.240&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.311&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jki&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 105312418747995.016&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;P&gt;with compiler version 14.0.2 -mkl has no influence on performance:&lt;/P&gt;

&lt;P&gt;&amp;gt; ifort --version&lt;BR /&gt;
	ifort (IFORT) 14.0.2 20140120&lt;BR /&gt;
	Copyright (C) 1985-2014 Intel Corporation.&amp;nbsp; All rights reserved.&lt;BR /&gt;
	&amp;gt; ifort -O3 -parallel -par-threshold99 -DARDIM=4000 -DSIZEARGS matmul.F&lt;BR /&gt;
	&amp;gt; OMP_NUM_THREADS=4 ./a.out&lt;BR /&gt;
	&amp;nbsp;Running with array sizes&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000 by&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.220&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.058&amp;nbsp;&amp;nbsp;&amp;nbsp; init&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 12.980&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 3.246&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ikj&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 12.810&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 3.204&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jki&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 105312418747995.016&lt;BR /&gt;
	&amp;gt; ifort -O3 -parallel -par-threshold99 -mkl -DARDIM=4000 -DSIZEARGS matmul.F&lt;BR /&gt;
	&amp;gt; OMP_NUM_THREADS=4 ./a.out&lt;BR /&gt;
	&amp;nbsp;Running with array sizes&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000 by&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4000&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.220&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.058&amp;nbsp;&amp;nbsp;&amp;nbsp; init&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 12.810&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 3.203&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ikj&lt;BR /&gt;
	&amp;nbsp; dtime:&amp;nbsp;&amp;nbsp;&amp;nbsp; 12.820&amp;nbsp;&amp;nbsp; real time:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 3.205&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; jki&lt;BR /&gt;
	&amp;nbsp;Sum of elements:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 105312418747995.016&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 09 May 2014 12:14:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007842#M105301</guid>
      <dc:creator>Axel_P_</dc:creator>
      <dc:date>2014-05-09T12:14:03Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;I have detected look like</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007843#M105302</link>
      <description>&lt;P&gt;&lt;EM&gt;&amp;gt;&amp;gt;I have detected look like an initialization effect of MKL. When I repeat the first nested loop (ifort13 -parallel -mkl) or the dgemm call (ifort14 -parallel -mkl) in the source, the second execution of the identical code sections runs at full speed. Sorry that I didn't notice that earlier. I do not yet understand why this initialization overhead varies so much when I repeat the runs.&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;The default MKL library is multi-threaded through use of OpenMP. The first call contains the overhead of creating the OpenMP thread pool (*** internal to MKL). Therefore, for timing purposes, the SOP is to discard the timing results for the first pass .OR. insert into your code, prior to the timed section, a call to MKL that you know establishes its thread pool.&lt;/P&gt;

&lt;P&gt;RE the ***&lt;/P&gt;

&lt;P&gt;When your application is multithreaded you might want to consider/experiment linking with the single threaded MKL. IOW each of your application threads can concurrently call MKL where each call into MKL continues using the same thread. Should each of your application threads call the multithreaded&amp;nbsp;MKL concurrently, then you tend to oversubscribe the number of threads (# concurrent calls) * (number of threads spawned per MKL instance). On a 4-core/8-thread system this could explode to 64 threads if done improperly.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Fri, 09 May 2014 12:42:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007843#M105302</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-05-09T12:42:29Z</dc:date>
    </item>
    <item>
      <title>If your application offers</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007844#M105303</link>
      <description>&lt;P&gt;If your application offers opportunity for parallelism at a higher level than individual MKL function calls, &amp;nbsp;that should be useful. &amp;nbsp;As Jim said, this might be done with the mkl sequential library. &amp;nbsp;&lt;/P&gt;

&lt;P&gt;Even if you call the threaded MKL from an OpenMP threaded region, MKL shouldn't use additional threads until you set OMP_NESTED. &amp;nbsp; &amp;nbsp;Even though the library may be able to prevent over-subscription, there aren't satisfactory methods to maintain data locality, so I agree in general with Jim's cautions.&lt;/P&gt;</description>
      <pubDate>Fri, 09 May 2014 13:15:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimization-problem-with-allocatable-arrays/m-p/1007844#M105303</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-05-09T13:15:08Z</dc:date>
    </item>
  </channel>
</rss>

