<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Optimized code and intrinsic functions for dense matrix/vector multiplication in Fortran in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimized-code-and-intrinsic-functions-for-dense-matrix-vector/m-p/1600351#M172282</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a dense m by n matrix A, and two vectors, x and y (with dimensions n and m, respectively).&lt;/P&gt;&lt;P&gt;The value of m is between 10K and 10 Million, while the value of n is between 100 and 1000.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I want to compute the products A*x and transpose(A)*y&lt;/P&gt;&lt;P&gt;What would be the best way to code this in Intel Fortran? Specifically, I have the following questions:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;I was thinking that the multiplication A*x must be something like this:&lt;/P&gt;&lt;P&gt;z = 0&amp;nbsp; &amp;nbsp; &amp;nbsp; ! z is an m-component vector&lt;/P&gt;&lt;P&gt;do ii = 1,n&lt;/P&gt;&lt;P&gt;z(:) = z(:) + A(:,ii)*x(ii)&lt;/P&gt;&lt;P&gt;end do !ii&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;Is this the best way to program the do loop, and ensure that it will be vectorized?&lt;/LI&gt;&lt;LI&gt;Also, can I use an OMP PARALLEL DO loop for the multiplication? If yes, does invoking OMP PARALLEL DO have any effect on what is the best way to program the loop? .&lt;/LI&gt;&lt;LI&gt;What is the best way to program transpose(A)*y?&lt;/LI&gt;&lt;LI&gt;Are there any MKL intrinsic functions that I could use for the above operations, so that I am sure that the multiplication is as optimized (in terms of speed) as possible?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Many thanks in advance for the help.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 23 May 2024 21:03:53 GMT</pubDate>
    <dc:creator>Ioannis_K_</dc:creator>
    <dc:date>2024-05-23T21:03:53Z</dc:date>
    <item>
      <title>Optimized code and intrinsic functions for dense matrix/vector multiplication in Fortran</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimized-code-and-intrinsic-functions-for-dense-matrix-vector/m-p/1600351#M172282</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a dense m by n matrix A, and two vectors, x and y (with dimensions n and m, respectively).&lt;/P&gt;&lt;P&gt;The value of m is between 10K and 10 Million, while the value of n is between 100 and 1000.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I want to compute the products A*x and transpose(A)*y&lt;/P&gt;&lt;P&gt;What would be the best way to code this in Intel Fortran? Specifically, I have the following questions:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;I was thinking that the multiplication A*x must be something like this:&lt;/P&gt;&lt;P&gt;z = 0&amp;nbsp; &amp;nbsp; &amp;nbsp; ! z is an m-component vector&lt;/P&gt;&lt;P&gt;do ii = 1,n&lt;/P&gt;&lt;P&gt;z(:) = z(:) + A(:,ii)*x(ii)&lt;/P&gt;&lt;P&gt;end do !ii&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;Is this the best way to program the do loop, and ensure that it will be vectorized?&lt;/LI&gt;&lt;LI&gt;Also, can I use an OMP PARALLEL DO loop for the multiplication? If yes, does invoking OMP PARALLEL DO have any effect on what is the best way to program the loop? .&lt;/LI&gt;&lt;LI&gt;What is the best way to program transpose(A)*y?&lt;/LI&gt;&lt;LI&gt;Are there any MKL intrinsic functions that I could use for the above operations, so that I am sure that the multiplication is as optimized (in terms of speed) as possible?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Many thanks in advance for the help.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 23 May 2024 21:03:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimized-code-and-intrinsic-functions-for-dense-matrix-vector/m-p/1600351#M172282</guid>
      <dc:creator>Ioannis_K_</dc:creator>
      <dc:date>2024-05-23T21:03:53Z</dc:date>
    </item>
    <item>
      <title>Re: Optimized code and intrinsic functions for dense matrix/vector multiplication in Fortran</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimized-code-and-intrinsic-functions-for-dense-matrix-vector/m-p/1600376#M172285</link>
      <description>&lt;P&gt;How big is z(:)?&lt;/P&gt;&lt;P&gt;That statement should vectorize.&lt;/P&gt;&lt;P&gt;However, if z(:), The expression to the right may create a temporary, generally on stack, and if z(:) has size of 10 million, you may experience a stack overflow.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If stack overflow occurs, create an inner loop&lt;/P&gt;&lt;LI-CODE lang="fortran"&gt;do jj=1,size(z)
  z(jj) = z(jj) + A(jj,ii) * x(ii)
end do&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This should also vectorize. Also consider using OpenMP on the do ii loop.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;MKL is very good for matrix operations (matmul, transpose, etc), but for simple operations such as the loop above, it may be better to simply code it in line.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Be aware of issues of oversubscription of OpenMP (or other) threaded applications together with the threaded MKL library.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 23 May 2024 22:14:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimized-code-and-intrinsic-functions-for-dense-matrix-vector/m-p/1600376#M172285</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2024-05-23T22:14:59Z</dc:date>
    </item>
    <item>
      <title>Re: Optimized code and intrinsic functions for dense matrix/vector multiplication in Fortran</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimized-code-and-intrinsic-functions-for-dense-matrix-vector/m-p/1600384#M172288</link>
      <description>&lt;P&gt;Jim,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;z(:) has the same size as the number of rows in A(:,:). So yes, a worst-case scenario would have 10 million elements in z.&lt;/P&gt;&lt;P&gt;I intend to have z(:) as a module variable with the SAVE statement (as I need it in other parts of my code). The coefficient matrix A(:,:) and the vector x(:) are also module variables. That is, they are not temporary variables in the routine.&lt;/P&gt;&lt;P&gt;Since you mentioned, I have benchmarked my code for cases where I use the stack or store everything in the heap (through the "&lt;SPAN class=""&gt;-heap-arrays&lt;/SPAN&gt;&lt;SPAN class=""&gt;&lt;SPAN&gt;&amp;nbsp;0&lt;/SPAN&gt;&lt;/SPAN&gt;" option), and I did not see any difference in speed.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The only question that I have remaining is:&lt;/P&gt;&lt;P&gt;What is the optimal way to code the multiplication&amp;nbsp;transpose(A)*y?&lt;/P&gt;&lt;P&gt;I store this result in a local vector variable v() (the maximum number of components in v() is 1000, so I do not imagine there would be any problem with stack overflow. My obvious approach would be to have a "transpose version" of my algorithm for A*x:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;v = 0&amp;nbsp; &amp;nbsp; &amp;nbsp; ! v is an n-component vector&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;do ii = 1,m&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;v(:) = v(:) + A(ii,:)*y(ii)&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;end do !ii&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Will this be optimal? My main concern is that the above loop relies on a linear combination of the rows of A(:,:), and the components of a given row are not in neighboring memory locations....&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 23 May 2024 22:30:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimized-code-and-intrinsic-functions-for-dense-matrix-vector/m-p/1600384#M172288</guid>
      <dc:creator>Ioannis_K_</dc:creator>
      <dc:date>2024-05-23T22:30:58Z</dc:date>
    </item>
    <item>
      <title>Re: Optimized code and intrinsic functions for dense matrix/vector multiplication in Fortran</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimized-code-and-intrinsic-functions-for-dense-matrix-vector/m-p/1600590#M172292</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&lt;EM&gt;I intend to have z(:) as a module variable with the SAVE statement&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Module variables are persistent and do not need to be attributed with SAVE.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;RE: inline transpose (swapping indices)&lt;/P&gt;&lt;P&gt;If you only reference transpose(A) once, or very few times, this should be more efficient.&lt;/P&gt;&lt;P&gt;Note, &lt;EM&gt;&lt;STRONG&gt;if&lt;/STRONG&gt;&lt;/EM&gt; the target CPU will have the instruction set containing SIMD scatter/gather&lt;/P&gt;&lt;P&gt;AVX2 with FMA&lt;/P&gt;&lt;P&gt;AVX-512PF&lt;/P&gt;&lt;P&gt;AVX-512F (all of the recent CPUs supporting AVX512 include the F variant)&lt;/P&gt;&lt;P&gt;AVX-512VL&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;then either set the compiler optimization for either:&lt;/P&gt;&lt;P&gt;a) only supporting CPU with scatter/gather (-x&amp;lt;code&amp;gt; or /Qx&amp;lt;code&amp;gt;, -march=&amp;lt;processor&amp;gt; or /arch=&amp;lt;processor&amp;gt;)&lt;/P&gt;&lt;P&gt;b) alternate path with&amp;nbsp;CPU with scatter/gather (-ax&amp;lt;code&amp;gt; or /Qax&amp;lt;code&amp;gt;)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Also, you may need to additionally coax in the gather simd instruction with &lt;A href="https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-1/simd-directive-for-openmp.html#GUID-179A38EE-4D40-4EB6-9717-4B625DDB1C27" target="_self"&gt;!$omp simd&lt;/A&gt;/!$omp end simd directives.&lt;/P&gt;&lt;P&gt;Verify with VTune disassembly.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jim&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 24 May 2024 11:48:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Optimized-code-and-intrinsic-functions-for-dense-matrix-vector/m-p/1600590#M172292</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2024-05-24T11:48:57Z</dc:date>
    </item>
  </channel>
</rss>

