Optimize Many Small SPARSE Matrix-Matrix Multiplications

Majid1 · ‎03-05-2020

I have many small sparse matrices and I want to multiply them together. Let's assume I can have them as small as desired. My main question is how I can do it efficiently with MKL.

I am aware of "Improved Small Matrix Performance Using Just-in-Time (JIT)", but it is for GEMM, so for dense matrices.

I am also aware of the Intel's open-source library LIBXSMM, but it only works when at least one of the matrices is dense. In my case, it would not be efficient to convert my sparse matrices to dense and then use one of these two methods/libraries.

I have attached my current version at the end. The input matrices are in my customized CSC format, which I believe the class member names are self-explanatory. I removed some lines to have it short and I don't have error-checking for the purpose of performance.

These are my questions and I would appreciate it if anyone could answer any of them:

1- The row and col_scan arrays in my code are in unsigned int, so I am casting them to int. I am aware of "narrowing the range" issue, but is there any problem with the way of passing my data to mkl_sparse_d_create_csc? Or is there a way to keep the range of unsigned int?

2- Is there a way to allocate memory for C and reuse it for all the calls to this function?

3- Is there a better way to rewrite this code? Or use another function?

4- Any suggestion for using another library suited for this case?

void MyClass::MKLSPMM(CSCMat &A, CSCMat &B, std::vector<cooEntry> &C, MPI_Comm comm){

        sparse_matrix_t Amkl = nullptr;
        mkl_sparse_d_create_csc(&Amkl, SPARSE_INDEX_BASE_ZERO, A.row_sz, A.col_sz, (int*)A.col_scan, (int*)(A.col_scan+1), (int*)A.r, A.v);

        sparse_matrix_t Bmkl = nullptr;
        mkl_sparse_d_create_csc(&Bmkl, SPARSE_INDEX_BASE_ZERO, B.row_sz, B.col_sz, (int*)B.col_scan, (int*)(B.col_scan+1), (int*)B.r, B.v);

        // Compute C = A * B
        sparse_matrix_t Cmkl = nullptr;
        mkl_sparse_spmm( SPARSE_OPERATION_NON_TRANSPOSE, Amkl, Bmkl, &Cmk );

    // the alternative version with sp2m
//        struct matrix_descr descr;
//        descr.type = SPARSE_MATRIX_TYPE_GENERAL;
//        mkl_sparse_sp2m(SPARSE_OPERATION_NON_TRANSPOSE, descr, Amkl,
//                        SPARSE_OPERATION_NON_TRANSPOSE, descr, Bmkl,
//                        SPARSE_STAGE_FULL_MULT, &Cmkl);

        mkl_sparse_d_export_csc( Cmkl, &indexing, &rows, &cols, &pointerB_C, &pointerE_C, &rows_C, &values_C );

        MKL_INT i, j, ii = 0;
        for (j = 0; j < B.col_sz; ++j) {
            for (i = pointerB_C; i < pointerE_C; ++i) {
                C.emplace_back(rows_C[ii], j, values_C[ii]);
                ++ii;
            }
        }

        mkl_sparse_destroy(Cmkl);
        mkl_sparse_destroy(Bmkl);
        mkl_sparse_destroy(Amkl);
}

Kirill_V_Intel · ‎03-05-2020

Hello,

1) Is there a particular reason you want your matrix indices to be unsigned int? IE SpBLAS supports both lp and ilp64 interfaces (64-bit indices).

2) Currently there is no such way in IE SpBLAS but we're considering whether we should extend the functionality. Let us know if it is critical for performance in your use case. Potentially, you could use old SpBLAS with mkl_dcscmultcsc which would allow allocating once an estimated storage where each output matrix fits and re-use it. But this routine is deprecated and we plan to replace it's functional coverage with IE SpBLAS completely.

3) I think the code looks fine.

My question to you: you said you need to multiply a lot of matrices.

a) How many matrices, which sizes (what is small in your case)? How sparse they are?

b) Do you want to multiply all matrices to form a single product A1*A2*...*An (chain product) or do you want something else? If you want a chain product, are you doing it with a tree-based approach or do you multiply them consequetively? Is there anything in common for the set of matrices?

Potentially, there are ways to make a chain product run faster than just multiply them with separate calls each time but not under the current API.

Best,
Kirill

Majid1 · ‎03-08-2020

Hello Kirin,

Thank you very much for your response.

1) Indices and column_scan cannot be negative, and using long would be much more expensive (especially communication-wise). So we have chosen unsigned int to have double the range of int. Can I ask the logic behind having signed datatypes to store indices or nonzero scan arrays?

2) I found the help page for mkl_?csrmultcsr, but I couldn't find anything for mkl_?cscmultcsc. Could you please point me to a guide page for it or any example? Even though it is deprecated, I want to check that.

3)

a) I am writing code for a general case, so all that depends on the input matrices to be multiplied, which can be arbitrary. So, we may have any number of matrix-multiplications, with any sparsity. To answer your question about how small they are, we can assume I can convert them to any size that we can have a fast sparse matrix-multiplication for.

b) It is actually not a chain matrix-product, and not based on a tree. I know it would help to have more information and more structure for my application. I would say they are independent matrix-products like this:

A1 * B1

some computations

A2 * B2

some computations

A3 * B3

and so on. And this can happen thousands of times so some memory management would help a lot.

I should add that we can guarantee that the sizes of the matrices are less than any arbitrary threshold. If the input matrix sizes are higher than that, we convert them to smaller matrices that satisfy the threshold.

Thank you again,

Majid.

Kirill_V_Intel · ‎03-09-2020

Hello Majid,

1) There are multiple discussions about using int or unsigned int, most of them agree that you should not use unsigned int at all. As an example:
"You should not use the unsigned integer types such as uint32_t, unless there is a valid reason such as representing a bit pattern rather than a number, or you need defined overflow modulo 2^N. In particular, do not use unsigned types to say a number will never be negative. Instead, use assertions for this." (https://google.github.io/styleguide/cppguide.html from Google's Style Guide)

2) I'm sorry, there is actually no explicit routine mkl_?cscmultcsc. However, you can use mkl_?csrmultcsr:

(A*B)_csc = (A*B)^T_csr = (B^T * A^T)_csr = (B_csc^T * A^T_csc)_csr = (B_csc^T * A^T_csc)_csr = (B_csr * A_csr)_csr.

So you can revert the order of arguments and pretend your arguments are in csr format, then mkl_?csrmultcsr will give you what you want (back in CSC format).

3) I see. We're investigating what is the best way to provide memory management control to the user in setups like yours, for adding to the IE SpBLAS.

I hope some of this helps.

Best,
Kirill

Hans_P_Intel · ‎03-10-2020

Regarding unsigned indexes, Fortran does not have unsigned integers for the reasons listed by Kirill. If compilers could like signed integral types more than unsigned integers, they would like signed integral loop induction variables to potentially better vectorize (simply spoken an unsigned index prevents a meaningful upper bound assumption for the loop induction variable aka overflow). Also, MKL routines shall serve Fortran, which is another reason. On top of the style-guide, I can recommend using a type-definition that introduces something like mysize_t or myuint_t but relying on an integer (signed). This way, one feels better calling the entity "a size" or an "unsigned" (positive) quantity.

Regarding your application, I would try minimizing/avoiding data-conversions or potential copy-in/out routines. For instance, it is unclear why you convert the result to COO as your input is in "customized CSC format" (std::vector::emplace_back can also cause reallocation or copy-to-new-position). The memory allocation issue can be perhaps avoided using a wisely designed interface that works with buffer that are allocated upfront. Every conversion will eat into your memory bandwidth constraint even more (making everything slower). You may also consider using sparse_matrix_t in the first place, which avoids calling mkl_sparse_d_create_csc inside of the core-routine. Conversion-wise, I believe there were MKL routines to convert between formats (perhaps even COO). Another advantage (beside of potential reuse of the ready-to-use data), the sparse_matrix_t allows to try-out other formats (beside of CSC). There are tons of sparse formats that all claim some advantage, and if I remember correctly MKL supports quite a few of them.

Have you checked if your application is block-sparse rather than using small sparse matices as the main primitives? I mean, having sparse inputs (both matrices) produces an (unpredictable) sparse result-pattern, which only gives you so much performance (you will be always constrained by irregular and bandwidth consuming operations). It might help to rely on zero fill-in in a smart fashion such that you get (partially) back to dense operations.

Majid1 · ‎03-11-2020

Hello Kirill,

Voronin, Kirill (Intel) wrote:
1) There are multiple discussions about using int or unsigned int, most of them agree that you should not use unsigned int at all. As an example:
"You should not use the unsigned integer types such as uint32_t, unless there is a valid reason such as representing a bit pattern rather than a number, or you need defined overflow modulo 2^N. In particular, do not use unsigned types to say a number will never be negative. Instead, use assertions for this." (https://google.github.io/styleguide/cppguide.html from Google's Style Guide)
2) I'm sorry, there is actually no explicit routine mkl_?cscmultcsc. However, you can use mkl_?csrmultcsr:
(A*B)_csc = (A*B)^T_csr = (B^T * A^T)_csr = (B_csc^T * A^T_csc)_csr = (B_csc^T * A^T_csc)_csr = (B_csr * A_csr)_csr.
So you can revert the order of arguments and pretend your arguments are in csr format, then mkl_?csrmultcsr will give you what you want (back in CSC format).
3) I see. We're investigating what is the best way to provide memory management control to the user in setups like yours, for adding to the IE SpBLAS.

1- Even though I am using uint because of the extra bit, I think I should reconsider using it if it is not very much accepted in the community.

2- That's a good point. I will try that.

3- Good to know that.

Thank you for your help.

Majid1 · ‎03-11-2020

Hi Hans,

Hans P. (Intel) wrote:
Regarding unsigned indexes, Fortran does not have unsigned integers for the reasons listed by Kirill. If compilers could like signed integral types more than unsigned integers, they would like signed integral loop induction variables to potentially better vectorize (simply spoken an unsigned index prevents a meaningful upper bound assumption for the loop induction variable aka overflow). Also, MKL routines shall serve Fortran, which is another reason. On top of the style-guide, I can recommend using a type-definition that introduces something like mysize_t or myuint_t but relying on an integer (signed). This way, one feels better calling the entity "a size" or an "unsigned" (positive) quantity.

That was a very good point about compilers and vectorization. I think I should switch to int then, at least for the auto-vectorization reason.

Hans P. (Intel) wrote:
Regarding your application, I would try minimizing/avoiding data-conversions or potential copy-in/out routines. For instance, it is unclear why you convert the result to COO as your input is in "customized CSC format" (std::vector::emplace_back can also cause reallocation or copy-to-new-position). The memory allocation issue can be perhaps avoided using a wisely designed interface that works with buffer that are allocated upfront. Every conversion will eat into your memory bandwidth constraint even more (making everything slower). You may also consider using sparse_matrix_t in the first place, which avoids calling mkl_sparse_d_create_csc inside of the core-routine. Conversion-wise, I believe there were MKL routines to convert between formats (perhaps even COO). Another advantage (beside of potential reuse of the ready-to-use data), the sparse_matrix_t allows to try-out other formats (beside of CSC). There are tons of sparse formats that all claim some advantage, and if I remember correctly MKL supports quite a few of them.

Sorry, I did not explain the COO part. I am storing the result matrix as COO, because the result is being processed for some other purposes and it is not being reused in MKL. If I wanted to reuse the result again in the same process, as you mentioned I would store them in the same format that was being passed to MKL to avoid the conversion costs. About allocating the buffer upfront, at first I wrote the matrix-matrix product myself and I allocated the memory (for the maximum size) once at first and reused it for all the products. But, since MKL is probably the fastest library to perform such kinds of operations, we decided to use MKL for this part. But now when using MKL, because of reallocating memory for each multiplication I may not gain as much as I expected. I will try Kiril's suggestion about mkl_?csrmultcsr to see if I can optimize that part. I will also look into using sparse_matrix_t as the matrix format.

Hans P. (Intel) wrote:
Have you checked if your application is block-sparse rather than using small sparse matices as the main primitives? I mean, having sparse inputs (both matrices) produces an (unpredictable) sparse result-pattern, which only gives you so much performance (you will be always constrained by irregular and bandwidth consuming operations). It might help to rely on zero fill-in in a smart fashion such that you get (partially) back to dense operations.

That's a very good point. I actually checked DBCSR and how they used your library, LIBXSMM, using the block-sparse row structure. I will consider that.

Thank you very much, Hans, for explaining very good points to me.