- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Best regards,
Jaewon
Example :
http://www.cise.ufl.edu/research/sparse/matrices/Hamm/memplus.html
- 17758-by-17758 w/ 126,150 nonzeros
- fill-in ratio : 5,121,784 / 126,150 = 40.6
On Intel Xeon E5410 @ 2.33GHz (8 cores) running 64-bit Vista, I need 5.42 seconds to perform sparse sparse level 3 multiplication using mkl_dcsrmultcsr while I only need 0.12 seconds with a simple 3 for-loop implementation (not threaded).
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jaewon,
Thanks for the report. We will have further check about the performance problem.
Thanks,
Chao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Jaewon,
Thank you very much for your report.
According timing results, I assume that the routine is called with trans='n'. Is it right?
The routine is planned as a supporting routine for PARDISO and as you know PARDISO requires that the column indices in each row must be sorted in increasing order. So the routine you tested also does sorting of elements in the result matrix so that PARDISO can be called immediately after this matrix-matrix multiply without any additional computational work. Unfortunately this data preparation for PARDISO and some kind of checking of the output matrix sometimes takes significant amount of time
Of course we will look at the performance issue reported by you and we try to do our best for further performance improvements of this routine.
The other alternative for improving performance is to introduce an option to turn off sorting in the output matrix. Please let us know if you need this kind of functionality (I mean unsorted output column indices and value arrays) for this routine?
All the best
Sergey- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Jaewon,
Thank you very much for your report.
According timing results, I assume that the routine is called with trans='n'. Is it right?
The routine is planned as a supporting routine for PARDISO and as you know PARDISO requires that the column indices in each row must be sorted in increasing order. So the routine you tested also does sorting of elements in the result matrix so that PARDISO can be called immediately after this matrix-matrix multiply without any additional computational work. Unfortunately this data preparation for PARDISO and some kind of checking of the output matrix sometimes takes significant amount of time
Of course we will look at the performance issue reported by you and we try to do our best for further performance improvements of this routine.
The other alternative for improving performance is to introduce an option to turn off sorting in the output matrix. Please let us know if you need this kind of functionality (I mean unsorted output column indices and value arrays) for this routine?
All the best
SergeyThanks Sergey.
Yes, I called mkl_dcsrmultcsr with trans = 'N'.
I think it's a brilliant idea. I need mkl_dcsrmultcsr to perform sparse matrix triple product. And with the result matrix I only perform sparse-dense level 2 operation, so the output matrix needs not be sorted.
Jaewon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Sergey.
Yes, I called mkl_dcsrmultcsr with trans = 'N'.
I think it's a brilliant idea. I need mkl_dcsrmultcsr to perform sparse matrix triple product. And with the result matrix I only perform sparse-dense level 2 operation, so the output matrix needs not be sorted.
Jaewon
I thought I would mention that I also be interested in an option to provide unsorted output.
Another curiosity on the performance front, I am calling mkl_dcsrmultcsr with both trans='N' and trans='T'. While I was expecting a bit a peformance difference, what I got back was a bit of a shock.
Performing my own matrix transpose call and then calling mkl_dcsrmultcsr completed in 0.07s (incl both functions)
But calling mkl_csrmultcsr with 'T" completed in 3.562s
If do the multiply without the transpose it takes 0.015s
Is that the difference I should be expecting? Has anyone else seen similar results?
Thanks,
Ian
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I thought I would mention that I also be interested in an option to provide unsorted output.
Another curiosity on the performance front, I am calling mkl_dcsrmultcsr with both trans='N' and trans='T'. While I was expecting a bit a peformance difference, what I got back was a bit of a shock.
Performing my own matrix transpose call and then calling mkl_dcsrmultcsr completed in 0.07s (incl both functions)
But calling mkl_csrmultcsr with 'T" completed in 3.562s
If do the multiply without the transpose it takes 0.015s
Is that the difference I should be expecting? Has anyone else seen similar results?
Thanks,
Ian
Dear Ian,
As I mentioned before the routine does mandatorysorting of column indices in theoutput matrix since it was designed as a supporting routine for DSS/PARDISO.PARDISO requieres that that the column indices in each row must be sorted in increasing order.Since A^T *B and A*B are different sparse matrices, the cost of sorting is different andthedifference in timecan be explained by different cost of sorting.Computational complexity of sort varies and it might be O(nnz) where nnz is the number of non-zeros in the output matrix in good case or
O(nnz*nnz) in the worst case.
The version withadditional switchesfor turning off ofsorting in output matrixis under development.
All the best
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Ian,
As I mentioned before the routine does mandatorysorting of column indices in theoutput matrix since it was designed as a supporting routine for DSS/PARDISO.PARDISO requieres that that the column indices in each row must be sorted in increasing order.Since A^T *B and A*B are different sparse matrices, the cost of sorting is different andthedifference in timecan be explained by different cost of sorting.Computational complexity of sort varies and it might be O(nnz) where nnz is the number of non-zeros in the output matrix in good case or
O(nnz*nnz) in the worst case.
The version withadditional switchesfor turning off ofsorting in output matrixis under development.
All the best
Sergey
Thanks again Sergey. That does make sense.
Though its kind of off topic for this particular thread, I am curious if a zero-based indexing option is in the works for ?csrmultcsr. I use zero based exclusively for my matrices, except for this particular function, which is less than ideal due to the extra passes over the index arrays to change them to one-based.
Regards,
Ian
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks again Sergey. That does make sense.
Though its kind of off topic for this particular thread, I am curious if a zero-based indexing option is in the works for ?csrmultcsr. I use zero based exclusively for my matrices, except for this particular function, which is less than ideal due to the extra passes over the index arrays to change them to one-based.
Regards,
Ian
Dear Ian,
Please file afeature request through premier.intel.comfor zero-based indexing option for ?csrmultcsr.
Thanks in advance
All the best
Sergey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page