Performance of mkl_?csrmultcsr

jaewonj · ‎09-13-2009

Though they say it's difficult to implement a multi-threaded sparse sparse level 3 function that scales well for all sparsity patterns, in most cases I experience 3 ~ 4 times speedup when I run mkl_?csrmultcsr on my 8-core machine. However, I found some matices for which mkl_dcsrmultcsr shows extremely poor performance (mkl_dcsrmultcsr could be 50 times slower than a simple 3-for-loop implementation). Will this issue be addressed in MKL 10.2 update 2?

Best regards,

Jaewon

Example :

http://www.cise.ufl.edu/research/sparse/matrices/Hamm/memplus.html

- 17758-by-17758 w/ 126,150 nonzeros
- fill-in ratio : 5,121,784 / 126,150 = 40.6

On Intel Xeon E5410 @ 2.33GHz (8 cores) running 64-bit Vista, I need 5.42 seconds to perform sparse sparse level 3 multiplication using mkl_dcsrmultcsr while I only need 0.12 seconds with a simple 3 for-loop implementation (not threaded).

Chao_Y_Intel · ‎09-13-2009

Jaewon,

Thanks for the report. We will have further check about the performance problem.

Thanks,
Chao

Sergey_K_Intel1 · ‎09-14-2009

Hello Jaewon,

Thank you very much for your report.

According timing results, I assume that the routine is called with trans='n'. Is it right?

The routine is planned as a supporting routine for PARDISO and as you know PARDISO requires that the column indices in each row must be sorted in increasing order. So the routine you tested also does sorting of elements in the result matrix so that PARDISO can be called immediately after this matrix-matrix multiply without any additional computational work. Unfortunately this data preparation for PARDISO and some kind of checking of the output matrix sometimes takes significant amount of time

Of course we will look at the performance issue reported by you and we try to do our best for further performance improvements of this routine.

The other alternative for improving performance is to introduce an option to turn off sorting in the output matrix. Please let us know if you need this kind of functionality (I mean unsorted output column indices and value arrays) for this routine?

All the best

Sergey

jaewonj · ‎09-15-2009

Quoting - Sergey Kuznetsov (Intel)

Hello Jaewon,

Thank you very much for your report.

According timing results, I assume that the routine is called with trans='n'. Is it right?

The routine is planned as a supporting routine for PARDISO and as you know PARDISO requires that the column indices in each row must be sorted in increasing order. So the routine you tested also does sorting of elements in the result matrix so that PARDISO can be called immediately after this matrix-matrix multiply without any additional computational work. Unfortunately this data preparation for PARDISO and some kind of checking of the output matrix sometimes takes significant amount of time

Of course we will look at the performance issue reported by you and we try to do our best for further performance improvements of this routine.

The other alternative for improving performance is to introduce an option to turn off sorting in the output matrix. Please let us know if you need this kind of functionality (I mean unsorted output column indices and value arrays) for this routine?

All the best
Sergey

Thanks Sergey.

Yes, I called mkl_dcsrmultcsr with trans = 'N'.

I think it's a brilliant idea. I need mkl_dcsrmultcsr to perform sparse matrix triple product. And with the result matrix I only perform sparse-dense level 2 operation, so the output matrix needs not be sorted.

Jaewon

Ian_Fraser · ‎10-27-2009

Quoting - jaewonj

Thanks Sergey.

Yes, I called mkl_dcsrmultcsr with trans = 'N'.

I think it's a brilliant idea. I need mkl_dcsrmultcsr to perform sparse matrix triple product. And with the result matrix I only perform sparse-dense level 2 operation, so the output matrix needs not be sorted.

Jaewon

I thought I would mention that I also be interested in an option to provide unsorted output.

Another curiosity on the performance front, I am calling mkl_dcsrmultcsr with both trans='N' and trans='T'. While I was expecting a bit a peformance difference, what I got back was a bit of a shock.

Performing my own matrix transpose call and then calling mkl_dcsrmultcsr completed in 0.07s (incl both functions)

But calling mkl_csrmultcsr with 'T" completed in 3.562s

If do the multiply without the transpose it takes 0.015s

Is that the difference I should be expecting? Has anyone else seen similar results?

Thanks,
Ian

Sergey_K_Intel1 · ‎10-28-2009

Quoting - Ian Fraser

I thought I would mention that I also be interested in an option to provide unsorted output.

Another curiosity on the performance front, I am calling mkl_dcsrmultcsr with both trans='N' and trans='T'. While I was expecting a bit a peformance difference, what I got back was a bit of a shock.

Performing my own matrix transpose call and then calling mkl_dcsrmultcsr completed in 0.07s (incl both functions)

But calling mkl_csrmultcsr with 'T" completed in 3.562s

If do the multiply without the transpose it takes 0.015s

Is that the difference I should be expecting? Has anyone else seen similar results?

Thanks,
Ian

Dear Ian,

As I mentioned before the routine does mandatorysorting of column indices in theoutput matrix since it was designed as a supporting routine for DSS/PARDISO.PARDISO requieres that that the column indices in each row must be sorted in increasing order.Since A^T *B and A*B are different sparse matrices, the cost of sorting is different andthedifference in timecan be explained by different cost of sorting.Computational complexity of sort varies and it might be O(nnz) where nnz is the number of non-zeros in the output matrix in good case or
O(nnz*nnz) in the worst case.

The version withadditional switchesfor turning off ofsorting in output matrixis under development.

All the best
Sergey

Ian_Fraser · ‎10-29-2009

Quoting - Sergey Kuznetsov (Intel)

Dear Ian,

As I mentioned before the routine does mandatorysorting of column indices in theoutput matrix since it was designed as a supporting routine for DSS/PARDISO.PARDISO requieres that that the column indices in each row must be sorted in increasing order.Since A^T *B and A*B are different sparse matrices, the cost of sorting is different andthedifference in timecan be explained by different cost of sorting.Computational complexity of sort varies and it might be O(nnz) where nnz is the number of non-zeros in the output matrix in good case or
O(nnz*nnz) in the worst case.

The version withadditional switchesfor turning off ofsorting in output matrixis under development.

All the best
Sergey

Thanks again Sergey. That does make sense.

Though its kind of off topic for this particular thread, I am curious if a zero-based indexing option is in the works for ?csrmultcsr. I use zero based exclusively for my matrices, except for this particular function, which is less than ideal due to the extra passes over the index arrays to change them to one-based.

Regards,
Ian

Sergey_K_Intel1 · ‎10-30-2009

Quoting - Ian Fraser

Thanks again Sergey. That does make sense.

Though its kind of off topic for this particular thread, I am curious if a zero-based indexing option is in the works for ?csrmultcsr. I use zero based exclusively for my matrices, except for this particular function, which is less than ideal due to the extra passes over the index arrays to change them to one-based.

Regards,
Ian

Dear Ian,

Please file afeature request through premier.intel.comfor zero-based indexing option for ?csrmultcsr.

Thanks in advance
All the best
Sergey