- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a routine for the back solve as follows:
void backsolve(const int*__restrict__ Lp,
const int*__restrict__ Li,
const double*__restrict__ Lx,
const int n,
double*__restrict__ x) {
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i]; j<Lp[i+1]; ++j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
}
compiling with gcc-8.3 -mfma -mavx -mavx512 results in
backsolve(int const*, int const*, double const*, int, double*):
lea eax, [rcx-1]
movsx r11, eax
lea r9, [r8+r11*8]
test eax, eax
js .L9
.L5:
movsx rax, DWORD PTR [rdi+r11*4]
mov r10d, DWORD PTR [rdi+4+r11*4]
cmp eax, r10d
jge .L6
vmovsd xmm0, QWORD PTR [r9]
.L7:
movsx rcx, DWORD PTR [rsi+rax*4]
vmovsd xmm1, QWORD PTR [rdx+rax*8]
add rax, 1
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r9], xmm0
cmp r10d, eax
jg .L7
.L6:
sub r11, 1
sub r9, 8
test r11d, r11d
jns .L5
ret
.L9:
ret
Vtune says the line
vmovsd QWORD PTR [r9], xmm0
is taking the bulk of the time here. I asked on stackoverflow (https://stackoverflow.com/questions/60232977/optimizing-the-backward-solve-for-a-sparse-lower-triangular-linear-system), and the answers I got seem to suggest that there is not much I can do to speed up the function. Using MKL for the solve was also much slower.
System: Xeon Skylake
Appreciate any insight or suggestions!
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page