Processors
Intel® Processors, Tools, and Utilities
14503 Discussions

## Optimizing the backward solve for a sparse lower triangular linear system

Beginner
696 Views

I have a routine for the back solve as follows:

```void backsolve(const int*__restrict__ Lp, const int*__restrict__ Li, const double*__restrict__ Lx, const int n, double*__restrict__ x) { for (int i=n-1; i>=0; --i) { for (int j=Lp[i]; j<Lp[i+1]; ++j) { x[i] -= Lx[j] * x[Li[j]]; } } }```

compiling with gcc-8.3 -mfma -mavx -mavx512 results in

```backsolve(int const*, int const*, double const*, int, double*): lea eax, [rcx-1] movsx r11, eax lea r9, [r8+r11*8] test eax, eax js .L9 .L5: movsx rax, DWORD PTR [rdi+r11*4] mov r10d, DWORD PTR [rdi+4+r11*4] cmp eax, r10d jge .L6 vmovsd xmm0, QWORD PTR [r9] .L7: movsx rcx, DWORD PTR [rsi+rax*4] vmovsd xmm1, QWORD PTR [rdx+rax*8] add rax, 1 vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8] vmovsd QWORD PTR [r9], xmm0 cmp r10d, eax jg .L7 .L6: sub r11, 1 sub r9, 8 test r11d, r11d jns .L5 ret .L9: ret```

Vtune says the line

`vmovsd QWORD PTR [r9], xmm0`

is taking the bulk of the time here. I asked on stackoverflow (https://stackoverflow.com/questions/60232977/optimizing-the-backward-solve-for-a-sparse-lower-triangular-linear-system), and the answers I got seem to suggest that there is not much I can do to speed up the function. Using MKL for the solve was also much slower.

System: Xeon Skylake

Appreciate any insight or suggestions!

1 Solution
Super User
575 Views

Wrong forum. You want to be [somewhere] here: https://software.intel.com/en-us/forum

Doc