<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic From a mere visual inspection in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125498#M25219</link>
    <description>&lt;P&gt;From a mere visual inspection of the assembly code generated, it strikes me that there is an unnecessary repeated memory write that is responsible for the slowdown.&lt;/P&gt;&lt;P&gt;Look at Line-8 of the C code and the corresponding Line-18 of the assembly code. While the innermost loop is executing, x&lt;I&gt; is never used on the right-hand-side, since none of the Li&lt;J&gt; can equal i. (The compiler, unfortunately, cannot know this, and I do not know if there are pragmas that can be used to pass that hint to the compiler.)&amp;nbsp;&lt;/J&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;Therefore, it should suffice to do something such as&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;double S  = x&lt;I&gt;;
for (int j=Lp&lt;I&gt;; j&amp;lt;Lp[i+1]; ++j) S -= Lx&lt;J&gt;*x[Li&lt;J&gt;];
x&lt;I&gt; = S;
&lt;/I&gt;&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;or&lt;/P&gt;

&lt;PRE class="brush:cpp; class-name:dark;"&gt;for (double S = 0, int j = Lp&lt;I&gt;; j &amp;lt; Lp[i+1]; ++j) S -= Lx&lt;J&gt;*x[Li&lt;J&gt;];
x&lt;I&gt; -= S;&lt;/I&gt;&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;and the compiler would see that S can be kept in a register (xmm0, for example) while the innermost loop is executing.&lt;/P&gt;
&lt;P&gt;If you cannot get your compiler to generate faster code with this change, you can generate assembly output from your current C code, and simply move the vmovsd instruction on Line-18&amp;nbsp;of the assembly listing to the line following .L6.&lt;/P&gt;
&lt;P&gt;The changes that I have suggested do not need much effort to implement. Try them and time the modified code.&lt;/P&gt;
&lt;P&gt;By the way, your matrix has some rather special features that are being exploited: lower triangular, sparse CSC storage, and unit diagonal. I would not expect a library such as MKL to provide an efficient routine for the special case rather than for a general sparse triangular matrix, unless there is stated demand for it from a number of users.&lt;/P&gt;</description>
    <pubDate>Thu, 20 Feb 2020 09:10:00 GMT</pubDate>
    <dc:creator>mecej4</dc:creator>
    <dc:date>2020-02-20T09:10:00Z</dc:date>
    <item>
      <title>Optimizing the backward solve for a sparse lower triangular linear system</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125496#M25217</link>
      <description>&lt;P&gt;I have the compressed sparse column (csc) representation of the n x n lower-triangular matrix A with zeros on the main diagonal, and would like to solve for b in&lt;/P&gt;
&lt;PRE class="brush:plain; class-name:dark;"&gt;(A + I)' * x = b&lt;/PRE&gt;

&lt;P&gt;This is the routine I have for computing this:&lt;/P&gt;

&lt;PRE class="brush:cpp; class-name:dark;"&gt;void backsolve(const int*__restrict__ Lp,
               const int*__restrict__ Li,
               const double*__restrict__ Lx,
               const int n,
               double*__restrict__ x) {
  for (int i=n-1; i&amp;gt;=0; --i) {
      for (int j=Lp&lt;I&gt;; j&amp;lt;Lp[i+1]; ++j) {
          x&lt;I&gt; -= Lx&lt;J&gt; * x[Li&lt;J&gt;];
      }
  }
}&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Thus,&amp;nbsp;b&amp;nbsp;is passed in via the argument&amp;nbsp;x, and is overwritten by the solution.&amp;nbsp;Lp,&amp;nbsp;Li,&amp;nbsp;Lx&amp;nbsp;are respectively the row, indices, and data pointers in the standard csc representation of sparse matrices. This function is the top hotspot in the program, with the line&lt;/P&gt;

&lt;PRE class="brush:cpp; class-name:dark;"&gt;x&lt;I&gt; -= Lx&lt;J&gt; * x[Li&lt;J&gt;];&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;being the bulk of the time spent. Compiling with&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:plain; class-name:dark;"&gt;gcc-8.3 -O3 -mfma -mavx -mavx512f&amp;nbsp;&lt;/PRE&gt;

&lt;P&gt;gives&lt;/P&gt;

&lt;PRE class="brush:plain; class-name:dark;"&gt;backsolve(int const*, int const*, double const*, int, double*):
        lea     eax, [rcx-1]
        movsx   r11, eax
        lea     r9, [r8+r11*8]
        test    eax, eax
        js      .L9
.L5:
        movsx   rax, DWORD PTR [rdi+r11*4]
        mov     r10d, DWORD PTR [rdi+4+r11*4]
        cmp     eax, r10d
        jge     .L6
        vmovsd  xmm0, QWORD PTR [r9]
.L7:
        movsx   rcx, DWORD PTR [rsi+rax*4]
        vmovsd  xmm1, QWORD PTR [rdx+rax*8]
        add     rax, 1
        vfnmadd231sd    xmm0, xmm1, QWORD PTR [r8+rcx*8]
        vmovsd  QWORD PTR [r9], xmm0
        cmp     r10d, eax
        jg      .L7
.L6:
        sub     r11, 1
        sub     r9, 8
        test    r11d, r11d
        jns     .L5
        ret
.L9:
        ret&lt;/PRE&gt;

&lt;P&gt;Vtune says that&lt;/P&gt;

&lt;PRE class="brush:plain; class-name:dark;"&gt;vmovsd  QWORD PTR [r9], xmm0&lt;/PRE&gt;

&lt;P&gt;is the slowest part.&amp;nbsp;I have almost no experience with assembly, and am at a loss as to how to further diagnose or optimize this operation. I have tried compiling with different flags to enable/disable SSE, FMA, etc, but nothing has worked. I've also tried using mkl_sparse_d_trsv to do this, but this is significantly slower too.&lt;/P&gt;
&lt;P&gt;Processor: Xeon Skylake&lt;/P&gt;
&lt;P&gt;OS: CentOS 7.5&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Question:&lt;/STRONG&gt;&amp;nbsp;what can I do to speed up this hotspot?&lt;/P&gt;
&lt;P&gt;Thank you!&lt;/P&gt;</description>
      <pubDate>Sun, 16 Feb 2020 05:23:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125496#M25217</guid>
      <dc:creator>watch_the_crab</dc:creator>
      <dc:date>2020-02-16T05:23:03Z</dc:date>
    </item>
    <item>
      <title>Hello!</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125497#M25218</link>
      <description>&lt;P&gt;Hello!&lt;/P&gt;&lt;P&gt;As you noticed, Intel MKL provides&amp;nbsp;mkl_sparse_d_trsv to perform this kind of operation. The purpose of the library is to provide optimized functionality so that the end user will not need to write code to perform a particular&amp;nbsp;operation but rather use an appropriate primitive from MKL, as a building block.&lt;/P&gt;&lt;P&gt;Can you elaborate about where exactly&amp;nbsp;mkl_sparse_d_trsv is slower?&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We'd appreciate if you share with us your workload (matrix and righthand side), more details&amp;nbsp;about the setup (is it sequential?) and how you link MKL. We'd like to check if our routine is really slower than your code and if so, provide a better solution.&lt;/P&gt;&lt;P&gt;Thanks,&lt;BR /&gt;Kirill&lt;/P&gt;</description>
      <pubDate>Wed, 19 Feb 2020 22:33:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125497#M25218</guid>
      <dc:creator>Kirill_V_Intel</dc:creator>
      <dc:date>2020-02-19T22:33:58Z</dc:date>
    </item>
    <item>
      <title>From a mere visual inspection</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125498#M25219</link>
      <description>&lt;P&gt;From a mere visual inspection of the assembly code generated, it strikes me that there is an unnecessary repeated memory write that is responsible for the slowdown.&lt;/P&gt;&lt;P&gt;Look at Line-8 of the C code and the corresponding Line-18 of the assembly code. While the innermost loop is executing, x&lt;I&gt; is never used on the right-hand-side, since none of the Li&lt;J&gt; can equal i. (The compiler, unfortunately, cannot know this, and I do not know if there are pragmas that can be used to pass that hint to the compiler.)&amp;nbsp;&lt;/J&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;Therefore, it should suffice to do something such as&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;double S  = x&lt;I&gt;;
for (int j=Lp&lt;I&gt;; j&amp;lt;Lp[i+1]; ++j) S -= Lx&lt;J&gt;*x[Li&lt;J&gt;];
x&lt;I&gt; = S;
&lt;/I&gt;&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;or&lt;/P&gt;

&lt;PRE class="brush:cpp; class-name:dark;"&gt;for (double S = 0, int j = Lp&lt;I&gt;; j &amp;lt; Lp[i+1]; ++j) S -= Lx&lt;J&gt;*x[Li&lt;J&gt;];
x&lt;I&gt; -= S;&lt;/I&gt;&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;and the compiler would see that S can be kept in a register (xmm0, for example) while the innermost loop is executing.&lt;/P&gt;
&lt;P&gt;If you cannot get your compiler to generate faster code with this change, you can generate assembly output from your current C code, and simply move the vmovsd instruction on Line-18&amp;nbsp;of the assembly listing to the line following .L6.&lt;/P&gt;
&lt;P&gt;The changes that I have suggested do not need much effort to implement. Try them and time the modified code.&lt;/P&gt;
&lt;P&gt;By the way, your matrix has some rather special features that are being exploited: lower triangular, sparse CSC storage, and unit diagonal. I would not expect a library such as MKL to provide an efficient routine for the special case rather than for a general sparse triangular matrix, unless there is stated demand for it from a number of users.&lt;/P&gt;</description>
      <pubDate>Thu, 20 Feb 2020 09:10:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125498#M25219</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2020-02-20T09:10:00Z</dc:date>
    </item>
  </channel>
</rss>

