topic There are many strategies mkl in Intel® oneAPI Math Kernel Library

about non-zeros distribution used by the mkl_sparse_?_mv function.

black__edgar — Wed, 15 Apr 2020 13:31:55 GMT

Hi all,

I am using the sparse matrix-vector multiplication operation in the MKL library.

I started with a CSR representation (the classical three arrays of the CSR format) and use the mkl_sparse_d_create_csr() function to create a "sparse_matrix_t" handle. Then I ran the mkl_sparse_optimize () function using the handle, and finally the mkl_sparse_d_mv() function for the desired operation.

It works. So far so good. The answers I am getting are correct.

I am able to manipulate the number of threads used in the solution by setting the environmental variable "OMP_NUM_THREADS". This also work as expected.

My question is:

How the sparse matrix is distributed among the treads?

is the distribution based on a similar number of rows per thread?

is it based on a similar number of non-zeros per thread?

or something else?

One more question: Can the user manipulate the distribution?

Thanks

There are many strategies mkl

Gennady_F_Intel — Thu, 16 Apr 2020 04:51:28 GMT

There are many strategies mkl spblas chooses to call the CSR format and mkl applies all of these at different times.

What do you mean by "Can the user manipulate the distribution?"

Quote:Gennady F. (Blackbelt)

black__edgar — Thu, 16 Apr 2020 13:43:09 GMT

Gennady F. (Blackbelt) wrote:
There are many strategies mkl spblas chooses to call the CSR format and mkl applies all of these at different times.
What do you mean by "Can the user manipulate the distribution?"

Thank you for your answer.

Basically, I would like to know is:

how are the non-zeros distributed among the threads?

Is each thread assigned a similar amount of non-zeros? or

Is each thread assigned a similar amount of rows? or

is something else done instead of the two strategies mentioned above?

can the user indicate how the non-zeros should be distributed among the threads?

Thanks

Hello Edgar,

Kirill_V_Intel — Sun, 19 Apr 2020 01:19:00 GMT

Hello Edgar,

The design of IE SpBLAS intentionally suggests that the user does not need to know about the details you've mentioned. There are multiple reasons which justify it.

Shortly speaking, the ideas of having opaque matrix handles and not exposing optimized data make it unreasonable to provide the user a low-level control over things like threads and work balancing. As a possibly non-obvious consequence, if you see that MKL routines (say, mkl_sparse_?_mv) dn't show optimal performance for your application which affects the overall app performance, you can always tell us about your case and we'll investigatewhether we can improve MKL to do better.

A wider explanation:
IE SpBLAS in MKL is storing internal optimized data inside opaque matrix handles, and can use optimized non-exposed storage formats internally. Since the optimized data are not exposed, MKL takes care of parallelization and work balancing.

For different formats different strategies are used (answering you questions about how nnz are distributed among threads). As you might imagine, there is a complicated dispatching inside IE SpBLAS and the details (like the distribution of the work between threads) can change every time new optimizations are implemented. Of course, distributing the work by giving each threads a similar amount of rows would be a bad idea, e.g. when the CSR is used for computations and the matrix nnz distribution over rows is skewed. So,in such case mkl_sparse_optimize will try to figure out what is the best internal format and the best work distrubution for it.

The user can choose a suitable threading (OpenMP or TBB, e.g., the most commonly used options) and control in the application the parallel runtime (e.g., set the # openmp threads and enable/disable nested parallelism) by means of standard routines and MKL service functions.

Best,
Kirill