Optimizing the backward solve for a sparse lower triangular linear system

watch_the_crab · ‎02-15-2020

I have the compressed sparse column (csc) representation of the n x n lower-triangular matrix A with zeros on the main diagonal, and would like to solve for b in

(A + I)' * x = b

This is the routine I have for computing this:

void backsolve(const int*__restrict__ Lp,
               const int*__restrict__ Li,
               const double*__restrict__ Lx,
               const int n,
               double*__restrict__ x) {
  for (int i=n-1; i>=0; --i) {
      for (int j=Lp; j<Lp[i+1]; ++j) {
          x -= Lx * x[Li];
      }
  }
}

Thus, b is passed in via the argument x, and is overwritten by the solution. Lp, Li, Lx are respectively the row, indices, and data pointers in the standard csc representation of sparse matrices. This function is the top hotspot in the program, with the line

x -= Lx * x[Li];

being the bulk of the time spent. Compiling with

gcc-8.3 -O3 -mfma -mavx -mavx512f

gives

backsolve(int const*, int const*, double const*, int, double*):
        lea     eax, [rcx-1]
        movsx   r11, eax
        lea     r9, [r8+r11*8]
        test    eax, eax
        js      .L9
.L5:
        movsx   rax, DWORD PTR [rdi+r11*4]
        mov     r10d, DWORD PTR [rdi+4+r11*4]
        cmp     eax, r10d
        jge     .L6
        vmovsd  xmm0, QWORD PTR [r9]
.L7:
        movsx   rcx, DWORD PTR [rsi+rax*4]
        vmovsd  xmm1, QWORD PTR [rdx+rax*8]
        add     rax, 1
        vfnmadd231sd    xmm0, xmm1, QWORD PTR [r8+rcx*8]
        vmovsd  QWORD PTR [r9], xmm0
        cmp     r10d, eax
        jg      .L7
.L6:
        sub     r11, 1
        sub     r9, 8
        test    r11d, r11d
        jns     .L5
        ret
.L9:
        ret

Vtune says that

vmovsd  QWORD PTR [r9], xmm0

is the slowest part. I have almost no experience with assembly, and am at a loss as to how to further diagnose or optimize this operation. I have tried compiling with different flags to enable/disable SSE, FMA, etc, but nothing has worked. I've also tried using mkl_sparse_d_trsv to do this, but this is significantly slower too.

Processor: Xeon Skylake

OS: CentOS 7.5

Question: what can I do to speed up this hotspot?

Thank you!

Kirill_V_Intel · ‎02-19-2020

Hello!

As you noticed, Intel MKL provides mkl_sparse_d_trsv to perform this kind of operation. The purpose of the library is to provide optimized functionality so that the end user will not need to write code to perform a particular operation but rather use an appropriate primitive from MKL, as a building block.

Can you elaborate about where exactly mkl_sparse_d_trsv is slower?

We'd appreciate if you share with us your workload (matrix and righthand side), more details about the setup (is it sequential?) and how you link MKL. We'd like to check if our routine is really slower than your code and if so, provide a better solution.

Thanks,
Kirill

mecej4 · ‎02-20-2020

From a mere visual inspection of the assembly code generated, it strikes me that there is an unnecessary repeated memory write that is responsible for the slowdown.

Look at Line-8 of the C code and the corresponding Line-18 of the assembly code. While the innermost loop is executing, x is never used on the right-hand-side, since none of the Li can equal i. (The compiler, unfortunately, cannot know this, and I do not know if there are pragmas that can be used to pass that hint to the compiler.)

Therefore, it should suffice to do something such as

double S  = x;
for (int j=Lp; j<Lp[i+1]; ++j) S -= Lx*x[Li];
x = S;

or

for (double S = 0, int j = Lp; j < Lp[i+1]; ++j) S -= Lx*x[Li];
x -= S;

and the compiler would see that S can be kept in a register (xmm0, for example) while the innermost loop is executing.

If you cannot get your compiler to generate faster code with this change, you can generate assembly output from your current C code, and simply move the vmovsd instruction on Line-18 of the assembly listing to the line following .L6.

The changes that I have suggested do not need much effort to implement. Try them and time the modified code.

By the way, your matrix has some rather special features that are being exploited: lower triangular, sparse CSC storage, and unit diagonal. I would not expect a library such as MKL to provide an efficient routine for the special case rather than for a general sparse triangular matrix, unless there is stated demand for it from a number of users.