- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you attempt to develop your own code for SpMV using different sparse storage formats, you should see that SpMV is straightforward to parallelize with OpenMP for CSR and BSR. For COO and CSC, however, racing conditions make the parallelization difficult. After several attempts which included the use of (a) atomic operations, (b) OpenMP's built-in vector reduction and (c) my own implementation of vector reduction with OpenMP, I find it impossible to speedup SpMV by multithreading when using the COO and CSC formats.
I compared my implementations with state-of-the-art calls to mkl_sparse_d_mv from MKL. For multiple matrices tested, it seems that there is no speedup whatsoever of SpMV when using COO with multithreading. For the SpMV using the CSC format, there seems however to be a light speedup, but nothing comparable to the speedup obtained when using CSR.
My questions are the following:
1. Is SpMV (i.e., mkl_sparse_d_mv with operation=SPARSE_OPERATION_NON_TRANSPOSE) not parallel when using the COO format?
2. Is SpMV (i.e., mkl_sparse_d_mv with operation=SPARSE_OPERATION_NON_TRANSPOSE) not parallel when using the CSC format?
My impression is that the answer to 1. is yes, it is not parallelized, and the answer to 2. is no, it is parallelized, but the scaling obtained is not nearly as good as when using CSR. I would like to get definite answers to these questions. If the answer to 2. is no, is it possible to know how SpMV is parallelized in this case? Is it with OpenMP, or maybe TBB?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nicolas,
Thanks for posting in Intel Communities.
We would like to request you to share with us a MKL version, OS details, sample reproducer and steps to reproduce(if any. It helps us recreate the issue at our end and help you accordingly.
Besides this, could you please share with us the performance characteristics observed by you?
Best Regards,
shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am using intel-oneAPI 2022 on Linux.
For example, using the ecology1 matrix from the SuiteSparse Matrix Collection, the average runtime of mkl_sparse_d_mv over 100 calls is given as follows:
in CSR format: 8.1 ms for 1 thread, 4.2 ms for 2 threads, 1.8 ms for 4 threads, 1.2 ms for 8 threads, 0.6 ms for 16 threads.
in COO format: 8.1 ms for 1 thread, 8.1 ms for 2 threads, 9.1 ms for 4 threads, 8.1 ms for 8 threads, 9.3 ms for 16 threads.
in CSC format: 9.3 ms for 1 thread, 6.2 ms for 2 threads, 4.1 ms for 4 threads, 13.6 ms for 8 threads, 27.0 ms for 16 threads.
I compiled my code as follows: icc -o main -DMKL_ILP64 -I${MKLROOT}/include -fopenmp main.c -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl
where the icc version is 2021.6.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nicolas,
Thanks for sharing the details. Could you please share with us the sample reproducer as well? It helps us recreate the issue at our environment and help you accordingly. Based on this, we will share the performance details with the concerned team and will raise a request to provide support for OpenMP threading for CSC and COO data format.
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
My question is about the general behavior of mkl_sparse_d_mv. I'm not searching for support to solve a problem with a specific matrix. So I'm not sure why you need a sample code. But here it is, I attach the file mkl_test.c. You'll have to go download the .mtx file for the ecology1 matrix on the website of the SuiteSparse Matrix Collection.
The code is compiled as follows using icc:
icc -o main -DMKL_ILP64 -I${MKLROOT}/include -fopenmp mkl_test.c -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nicolas,
Thanks for sharing the details. Yes, regarding the initial queries, COO and CSC both are not threaded.
Thanks for helping us improve our products! We’ve submitted the feature request to the dev team, they will consider it based on multiple factors including, but not limited to priority and criticality of the feature. Once it is included in an upcoming release, it would be documented in the release notes.
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you! That's all I wanted to know :-).

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page