Solved: Slower performance of new iterative sparse solvers (cg and fgmres)

HCH298 · ‎08-07-2022

Hi there,

I have an "old" fortran code (written about 10 years ago during my PhD study) that was developed to solve a large linear algebraic equation system. The coefficient matrix of the equation system is sparse and SPD. Therefore, I used the conjugate gradient (CG) iterative sparse solver based on the spblas2 function of mkl_dcsrsymv. At the time when the code was initially developed, this is the right function to use in the CG routine for RCI_request = 1. Recently, when I was working on improving the performance of the solver by providing a conditioner to the solver and noticed a change. This function, i.e., mkl_dcsrsymv, and other similar functions are now deprecated. Instead, it is now replaced by a new spblas2 function mkl_sparse_d_mv. I noticed the change, in terms of using these two function, is not that significant. So, I decided to update the code to use the new function. The update is easy and pretty straightforward. However, after comparing the computation time between using these two functions, i.e., mkl_dcsrsymv vs mkl_sparse_d_mv, to solve the same large equations system, I noticed the time is more than doubled, i.e., the new cg routine uses more than twice of the amount of time of the old cg routine. But the only difference is the use of mkl_sparse_d_mv function replacing mkl_dcsrsymv function. The total number of iterations are very close.

I'm not quite sure the double time increase for the same equation system is due to indeed the only change of the aforementioned function, or it is because I'm not using the correct package for optimum performance of the new function. I have installed the OneAPI Base and HPC Toolkits for Linux back in 2020. Any insights will be much appreciated.

Gajanan_Choudhary · ‎08-08-2022

Hello @HCH298,

Thanks for reaching out to us about this. I would like to ask you some follow up questions:

1. Are you calling mkl_sparse_set_mv_hint() and mkl_sparse_optimize() before calling mkl_sparse_?_mv()?

2. What CPU hardware are you running your program on?

3. Are you using OpenMP threading or TBB threading?

4. You mentioned "The total number of iterations are very close" but also mentioned that "the only difference is the use of mkl_sparse_d_mv function replacing mkl_dcsrsymv function". I would expect the total number of iterations to be exactly the same if the only change is the MV operation, unless there is randomness in your code somewhere. Is there randomness in your code? Of course, performance varies run to run, but especially so if there are variations in the number of iterations (and therefore number of MV operations) as well.

Since you are moving from the old sparse BLAS APIs to the newer ones in oneMKL, the following information may be helpful if you do not already know it:

The new "Inspector Executor Sparse BLAS" (IE SpBLAS) routines for MV and other operations follow a two-stage process now. The first stage is to inspect the matrix and the operation you are about to perform and perform optimizations on the matrix. The second stage is to execute the operation. In case of MV operation, you would do the following:
1. Analysis stage: Call mkl_sparse_set_mv_hint() to provide hints to oneMKL on the operation you are about to perform (number of MV operations and transpose/non-transpose/conjugate-transpose case). Then call mkl_sparse_optimize() wherein oneMKL chooses performs matrix optimizations for the operation.

2. Execution stage: Call mkl_sparse_?_mv() to perform MV.

A code example is available in the oneMKL Fortran examples directory: (${MKLROOT}/examples/f/sparse_blas/source/sparse_csr.f90), in case it helps.

Hope that helps!

Gajanan Choudhary, (developer in the oneMKL team)

View solution in original post

Gajanan_Choudhary · ‎08-08-2022

Hello @HCH298,

Thanks for reaching out to us about this. I would like to ask you some follow up questions:

1. Are you calling mkl_sparse_set_mv_hint() and mkl_sparse_optimize() before calling mkl_sparse_?_mv()?

2. What CPU hardware are you running your program on?

3. Are you using OpenMP threading or TBB threading?

4. You mentioned "The total number of iterations are very close" but also mentioned that "the only difference is the use of mkl_sparse_d_mv function replacing mkl_dcsrsymv function". I would expect the total number of iterations to be exactly the same if the only change is the MV operation, unless there is randomness in your code somewhere. Is there randomness in your code? Of course, performance varies run to run, but especially so if there are variations in the number of iterations (and therefore number of MV operations) as well.

Since you are moving from the old sparse BLAS APIs to the newer ones in oneMKL, the following information may be helpful if you do not already know it:

The new "Inspector Executor Sparse BLAS" (IE SpBLAS) routines for MV and other operations follow a two-stage process now. The first stage is to inspect the matrix and the operation you are about to perform and perform optimizations on the matrix. The second stage is to execute the operation. In case of MV operation, you would do the following:
1. Analysis stage: Call mkl_sparse_set_mv_hint() to provide hints to oneMKL on the operation you are about to perform (number of MV operations and transpose/non-transpose/conjugate-transpose case). Then call mkl_sparse_optimize() wherein oneMKL chooses performs matrix optimizations for the operation.

2. Execution stage: Call mkl_sparse_?_mv() to perform MV.

A code example is available in the oneMKL Fortran examples directory: (${MKLROOT}/examples/f/sparse_blas/source/sparse_csr.f90), in case it helps.

Hope that helps!

Gajanan Choudhary, (developer in the oneMKL team)

HCH298 · ‎08-08-2022

@Gajanan_Choudhary Many thanks for your help. I did what you suggested in the reply by doing the two-stage process for the sparse mv operation. The computation time now decreased by almost half of using the old mkl_dcsrsymv function for the specific problem I have ran. It solved my problem.

Regarding point No. 4 in your reply: no, there is no randomness in my code. The same number of iterations and total computation time can be reproduced using the same sparse mv function. But the number of iterations for using mkl_sparse_d_mv is slightly larger than that of using mkl_dcsrsymv. However, the total computation time is still reduced by almost half for the specific problem.

ShanmukhS_Intel · ‎08-16-2022

Hi,

Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Best Regards,

Shanmukh.SS