Solved: Optimizing the backward solve for a sparse lower triangular linear system

watch_the_crab · ‎02-16-2020

I have a routine for the back solve as follows:

void backsolve(const int*__restrict__ Lp,
               const int*__restrict__ Li,
               const double*__restrict__ Lx,
               const int n,
               double*__restrict__ x) {
  for (int i=n-1; i>=0; --i) {
      for (int j=Lp[i]; j<Lp[i+1]; ++j) {
          x[i] -= Lx[j] * x[Li[j]];
      }
  }
}

compiling with gcc-8.3 -mfma -mavx -mavx512 results in

backsolve(int const*, int const*, double const*, int, double*):
        lea     eax, [rcx-1]
        movsx   r11, eax
        lea     r9, [r8+r11*8]
        test    eax, eax
        js      .L9
.L5:
        movsx   rax, DWORD PTR [rdi+r11*4]
        mov     r10d, DWORD PTR [rdi+4+r11*4]
        cmp     eax, r10d
        jge     .L6
        vmovsd  xmm0, QWORD PTR [r9]
.L7:
        movsx   rcx, DWORD PTR [rsi+rax*4]
        vmovsd  xmm1, QWORD PTR [rdx+rax*8]
        add     rax, 1
        vfnmadd231sd    xmm0, xmm1, QWORD PTR [r8+rcx*8]
        vmovsd  QWORD PTR [r9], xmm0
        cmp     r10d, eax
        jg      .L7
.L6:
        sub     r11, 1
        sub     r9, 8
        test    r11d, r11d
        jns     .L5
        ret
.L9:
        ret

Vtune says the line

vmovsd QWORD PTR [r9], xmm0

is taking the bulk of the time here. I asked on stackoverflow (https://stackoverflow.com/questions/60232977/optimizing-the-backward-solve-for-a-sparse-lower-triangular-linear-system), and the answers I got seem to suggest that there is not much I can do to speed up the function. Using MKL for the solve was also much slower.

System: Xeon Skylake

Appreciate any insight or suggestions!

AlHill · ‎02-16-2020

Wrong forum. You want to be [somewhere] here: https://software.intel.com/en-us/forum

Doc

View solution in original post

AlHill · ‎02-16-2020

Wrong forum. You want to be [somewhere] here: https://software.intel.com/en-us/forum

Doc