topic Optimizing the backward solve for a sparse lower triangular linear system in IntelĀ® oneAPI Math Kernel Library
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125496#M25217
<P>I have the compressed sparse column (csc) representation of the n x n lower-triangular matrix A with zeros on the main diagonal, and would like to solve for b in</P>
<PRE class="brush:plain; class-name:dark;">(A + I)' * x = b</PRE>
<P>This is the routine I have for computing this:</P>
<PRE class="brush:cpp; class-name:dark;">void backsolve(const int*__restrict__ Lp,
const int*__restrict__ Li,
const double*__restrict__ Lx,
const int n,
double*__restrict__ x) {
for (int i=n-1; i>=0; --i) {
for (int j=Lp<I>; j<Lp[i+1]; ++j) {
x<I> -= Lx<J> * x[Li<J>];
}
}
}</J></J></I></I></PRE>
<P>Thus, b is passed in via the argument x, and is overwritten by the solution. Lp, Li, Lx are respectively the row, indices, and data pointers in the standard csc representation of sparse matrices. This function is the top hotspot in the program, with the line</P>
<PRE class="brush:cpp; class-name:dark;">x<I> -= Lx<J> * x[Li<J>];</J></J></I></PRE>
<P>being the bulk of the time spent. Compiling with </P>
<PRE class="brush:plain; class-name:dark;">gcc-8.3 -O3 -mfma -mavx -mavx512f </PRE>
<P>gives</P>
<PRE class="brush:plain; class-name:dark;">backsolve(int const*, int const*, double const*, int, double*):
lea eax, [rcx-1]
movsx r11, eax
lea r9, [r8+r11*8]
test eax, eax
js .L9
.L5:
movsx rax, DWORD PTR [rdi+r11*4]
mov r10d, DWORD PTR [rdi+4+r11*4]
cmp eax, r10d
jge .L6
vmovsd xmm0, QWORD PTR [r9]
.L7:
movsx rcx, DWORD PTR [rsi+rax*4]
vmovsd xmm1, QWORD PTR [rdx+rax*8]
add rax, 1
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r9], xmm0
cmp r10d, eax
jg .L7
.L6:
sub r11, 1
sub r9, 8
test r11d, r11d
jns .L5
ret
.L9:
ret</PRE>
<P>Vtune says that</P>
<PRE class="brush:plain; class-name:dark;">vmovsd QWORD PTR [r9], xmm0</PRE>
<P>is the slowest part. I have almost no experience with assembly, and am at a loss as to how to further diagnose or optimize this operation. I have tried compiling with different flags to enable/disable SSE, FMA, etc, but nothing has worked. I've also tried using mkl_sparse_d_trsv to do this, but this is significantly slower too.</P>
<P>Processor: Xeon Skylake</P>
<P>OS: CentOS 7.5</P>
<P><STRONG>Question:</STRONG> what can I do to speed up this hotspot?</P>
<P>Thank you!</P>Sun, 16 Feb 2020 05:23:03 GMTwatch_the_crab2020-02-16T05:23:03ZOptimizing the backward solve for a sparse lower triangular linear system
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125496#M25217
<P>I have the compressed sparse column (csc) representation of the n x n lower-triangular matrix A with zeros on the main diagonal, and would like to solve for b in</P>
<PRE class="brush:plain; class-name:dark;">(A + I)' * x = b</PRE>
<P>This is the routine I have for computing this:</P>
<PRE class="brush:cpp; class-name:dark;">void backsolve(const int*__restrict__ Lp,
const int*__restrict__ Li,
const double*__restrict__ Lx,
const int n,
double*__restrict__ x) {
for (int i=n-1; i>=0; --i) {
for (int j=Lp<I>; j<Lp[i+1]; ++j) {
x<I> -= Lx<J> * x[Li<J>];
}
}
}</J></J></I></I></PRE>
<P>Thus, b is passed in via the argument x, and is overwritten by the solution. Lp, Li, Lx are respectively the row, indices, and data pointers in the standard csc representation of sparse matrices. This function is the top hotspot in the program, with the line</P>
<PRE class="brush:cpp; class-name:dark;">x<I> -= Lx<J> * x[Li<J>];</J></J></I></PRE>
<P>being the bulk of the time spent. Compiling with </P>
<PRE class="brush:plain; class-name:dark;">gcc-8.3 -O3 -mfma -mavx -mavx512f </PRE>
<P>gives</P>
<PRE class="brush:plain; class-name:dark;">backsolve(int const*, int const*, double const*, int, double*):
lea eax, [rcx-1]
movsx r11, eax
lea r9, [r8+r11*8]
test eax, eax
js .L9
.L5:
movsx rax, DWORD PTR [rdi+r11*4]
mov r10d, DWORD PTR [rdi+4+r11*4]
cmp eax, r10d
jge .L6
vmovsd xmm0, QWORD PTR [r9]
.L7:
movsx rcx, DWORD PTR [rsi+rax*4]
vmovsd xmm1, QWORD PTR [rdx+rax*8]
add rax, 1
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r9], xmm0
cmp r10d, eax
jg .L7
.L6:
sub r11, 1
sub r9, 8
test r11d, r11d
jns .L5
ret
.L9:
ret</PRE>
<P>Vtune says that</P>
<PRE class="brush:plain; class-name:dark;">vmovsd QWORD PTR [r9], xmm0</PRE>
<P>is the slowest part. I have almost no experience with assembly, and am at a loss as to how to further diagnose or optimize this operation. I have tried compiling with different flags to enable/disable SSE, FMA, etc, but nothing has worked. I've also tried using mkl_sparse_d_trsv to do this, but this is significantly slower too.</P>
<P>Processor: Xeon Skylake</P>
<P>OS: CentOS 7.5</P>
<P><STRONG>Question:</STRONG> what can I do to speed up this hotspot?</P>
<P>Thank you!</P>Sun, 16 Feb 2020 05:23:03 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125496#M25217watch_the_crab2020-02-16T05:23:03ZHello!
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125497#M25218
<P>Hello!</P><P>As you noticed, Intel MKL provides mkl_sparse_d_trsv to perform this kind of operation. The purpose of the library is to provide optimized functionality so that the end user will not need to write code to perform a particular operation but rather use an appropriate primitive from MKL, as a building block.</P><P>Can you elaborate about where exactly mkl_sparse_d_trsv is slower? </P><P>We'd appreciate if you share with us your workload (matrix and righthand side), more details about the setup (is it sequential?) and how you link MKL. We'd like to check if our routine is really slower than your code and if so, provide a better solution.</P><P>Thanks,<BR />Kirill</P>Wed, 19 Feb 2020 22:33:58 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125497#M25218Kirill_V_Intel2020-02-19T22:33:58ZFrom a mere visual inspection
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125498#M25219
<P>From a mere visual inspection of the assembly code generated, it strikes me that there is an unnecessary repeated memory write that is responsible for the slowdown.</P><P>Look at Line-8 of the C code and the corresponding Line-18 of the assembly code. While the innermost loop is executing, x<I> is never used on the right-hand-side, since none of the Li<J> can equal i. (The compiler, unfortunately, cannot know this, and I do not know if there are pragmas that can be used to pass that hint to the compiler.) </J></I></P><P>Therefore, it should suffice to do something such as</P>
<PRE class="brush:cpp; class-name:dark;">double S = x<I>;
for (int j=Lp<I>; j<Lp[i+1]; ++j) S -= Lx<J>*x[Li<J>];
x<I> = S;
</I></J></J></I></I></PRE>
<P>or</P>
<PRE class="brush:cpp; class-name:dark;">for (double S = 0, int j = Lp<I>; j < Lp[i+1]; ++j) S -= Lx<J>*x[Li<J>];
x<I> -= S;</I></J></J></I></PRE>
<P>and the compiler would see that S can be kept in a register (xmm0, for example) while the innermost loop is executing.</P>
<P>If you cannot get your compiler to generate faster code with this change, you can generate assembly output from your current C code, and simply move the vmovsd instruction on Line-18 of the assembly listing to the line following .L6.</P>
<P>The changes that I have suggested do not need much effort to implement. Try them and time the modified code.</P>
<P>By the way, your matrix has some rather special features that are being exploited: lower triangular, sparse CSC storage, and unit diagonal. I would not expect a library such as MKL to provide an efficient routine for the special case rather than for a general sparse triangular matrix, unless there is stated demand for it from a number of users.</P>Thu, 20 Feb 2020 09:10:00 GMThttps://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Optimizing-the-backward-solve-for-a-sparse-lower-triangular/m-p/1125498#M25219mecej42020-02-20T09:10:00Z