- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I have the compressed sparse column (csc) representation of the n x n lower-triangular matrix A with zeros on the main diagonal, and would like to solve for b in

(A + I)' * x = b

This is the routine I have for computing this:

void backsolve(const int*__restrict__ Lp, const int*__restrict__ Li, const double*__restrict__ Lx, const int n, double*__restrict__ x) { for (int i=n-1; i>=0; --i) { for (int j=Lp; j<Lp[i+1]; ++j) { x-= Lx* x[Li ]; } } }

Thus, b is passed in via the argument x, and is overwritten by the solution. Lp, Li, Lx are respectively the row, indices, and data pointers in the standard csc representation of sparse matrices. This function is the top hotspot in the program, with the line

x-= Lx* x[Li ];

being the bulk of the time spent. Compiling with

gcc-8.3 -O3 -mfma -mavx -mavx512f

gives

backsolve(int const*, int const*, double const*, int, double*): lea eax, [rcx-1] movsx r11, eax lea r9, [r8+r11*8] test eax, eax js .L9 .L5: movsx rax, DWORD PTR [rdi+r11*4] mov r10d, DWORD PTR [rdi+4+r11*4] cmp eax, r10d jge .L6 vmovsd xmm0, QWORD PTR [r9] .L7: movsx rcx, DWORD PTR [rsi+rax*4] vmovsd xmm1, QWORD PTR [rdx+rax*8] add rax, 1 vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8] vmovsd QWORD PTR [r9], xmm0 cmp r10d, eax jg .L7 .L6: sub r11, 1 sub r9, 8 test r11d, r11d jns .L5 ret .L9: ret

Vtune says that

vmovsd QWORD PTR [r9], xmm0

is the slowest part. I have almost no experience with assembly, and am at a loss as to how to further diagnose or optimize this operation. I have tried compiling with different flags to enable/disable SSE, FMA, etc, but nothing has worked. I've also tried using mkl_sparse_d_trsv to do this, but this is significantly slower too.

Processor: Xeon Skylake

OS: CentOS 7.5

**Question:** what can I do to speed up this hotspot?

Thank you!

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello!

As you noticed, Intel MKL provides mkl_sparse_d_trsv to perform this kind of operation. The purpose of the library is to provide optimized functionality so that the end user will not need to write code to perform a particular operation but rather use an appropriate primitive from MKL, as a building block.

Can you elaborate about where exactly mkl_sparse_d_trsv is slower?

We'd appreciate if you share with us your workload (matrix and righthand side), more details about the setup (is it sequential?) and how you link MKL. We'd like to check if our routine is really slower than your code and if so, provide a better solution.

Thanks,

Kirill

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

From a mere visual inspection of the assembly code generated, it strikes me that there is an unnecessary repeated memory write that is responsible for the slowdown.

Look at Line-8 of the C code and the corresponding Line-18 of the assembly code. While the innermost loop is executing, x* is never used on the right-hand-side, since none of the Li can equal i. (The compiler, unfortunately, cannot know this, and I do not know if there are pragmas that can be used to pass that hint to the compiler.) *

Therefore, it should suffice to do something such as

double S = x; for (int j=Lp; j<Lp[i+1]; ++j) S -= Lx*x[Li ]; x = S;

or

for (double S = 0, int j = Lp; j < Lp[i+1]; ++j) S -= Lx*x[Li ]; x -= S;

and the compiler would see that S can be kept in a register (xmm0, for example) while the innermost loop is executing.

If you cannot get your compiler to generate faster code with this change, you can generate assembly output from your current C code, and simply move the vmovsd instruction on Line-18 of the assembly listing to the line following .L6.

The changes that I have suggested do not need much effort to implement. Try them and time the modified code.

By the way, your matrix has some rather special features that are being exploited: lower triangular, sparse CSC storage, and unit diagonal. I would not expect a library such as MKL to provide an efficient routine for the special case rather than for a general sparse triangular matrix, unless there is stated demand for it from a number of users.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page