Processors
Processors (Intel® Core™, Intel® Xeon®, etc); processor utilities and programs (Intel® Processor Identification Utility, Intel® Extreme Tuning Utility, Intel® Easy Streaming Wizard, etc.)
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
11687 Discussions

## Optimizing the backward solve for a sparse lower triangular linear system

Beginner
244 Views

I have a routine for the back solve as follows:

```void backsolve(const int*__restrict__ Lp, const int*__restrict__ Li, const double*__restrict__ Lx, const int n, double*__restrict__ x) { for (int i=n-1; i>=0; --i) { for (int j=Lp[i]; j<Lp[i+1]; ++j) { x[i] -= Lx[j] * x[Li[j]]; } } }```

compiling with gcc-8.3 -mfma -mavx -mavx512 results in

```backsolve(int const*, int const*, double const*, int, double*): lea eax, [rcx-1] movsx r11, eax lea r9, [r8+r11*8] test eax, eax js .L9 .L5: movsx rax, DWORD PTR [rdi+r11*4] mov r10d, DWORD PTR [rdi+4+r11*4] cmp eax, r10d jge .L6 vmovsd xmm0, QWORD PTR [r9] .L7: movsx rcx, DWORD PTR [rsi+rax*4] vmovsd xmm1, QWORD PTR [rdx+rax*8] add rax, 1 vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8] vmovsd QWORD PTR [r9], xmm0 cmp r10d, eax jg .L7 .L6: sub r11, 1 sub r9, 8 test r11d, r11d jns .L5 ret .L9: ret```

Vtune says the line

`vmovsd QWORD PTR [r9], xmm0`

is taking the bulk of the time here. I asked on stackoverflow (https://stackoverflow.com/questions/60232977/optimizing-the-backward-solve-for-a-sparse-lower-triang...), and the answers I got seem to suggest that there is not much I can do to speed up the function. Using MKL for the solve was also much slower.

System: Xeon Skylake

Appreciate any insight or suggestions!

1 Solution
Super User
123 Views

Wrong forum. You want to be [somewhere] here: https://software.intel.com/en-us/forum

Doc