30-35% of speedup - thanks

Vineet_Y_ · ‎12-09-2014

Hi

I have used mkl_dcsrmultcsr in my research. However, it is performing double pass to compute sparse*sparse matrix product. For small size problems, this is not a problem, however for large size problems (e.g. matrices of size ½ billion by ½ billion) this is time consuming and it would be better if MKL can do this multiplication in a single pass in a parallel setup.

Most (if not all) of the sparse*sparse CSR matrix multiplications algorithm use Gustavson Algorithm (ACM 1978) and there is no reason why this algorithm cannot be parallelized and do calculations in a single pass. I understand that the performance of a single pass parallelization would depend on pre-allocating the space non-zero values, which I think can be reasonably given in most situations and even if this does not work the algorithm should be able to adjust the buffer size (if required).

Similarly, it would be useful to only compute lower/upper triangular portion of the output matrix (of course the output matrix have to be symmetric).

Application domain: Statistics, PDE’s, Inverse Problems, Weather Prediction.

Thanks

Vineet

Gennady_F_Intel · ‎12-12-2014

Vineet,

the current API doesn't allow to do that but we are thinking about such option into the next API we are working on right now.

--Gennady

Vineet_Y_ · ‎12-17-2014

Hi

Gennady

After spending considerable amount of time last week I was able to come up with my own single pass parallel solution which is ~ 30 to 35% faster than mkl_dcsrmultcsr. It can also work in a hybrid setup (MPI+Openmp) and gives the output matrix in a sorted form. Any help (if possible ; no compulsions) in optimizing the attached code would be greatly appreciated as for large matrices it can save days worth of work.

Vineet

Gennady_F_Intel · ‎12-17-2014

30-35% of speedup - thanks for sharing this. Are these any specific input matrixes? what is the typical size, nnz and type of this matrices? what is the CPU you are running this code?

Vineet_Y_ · ‎12-18-2014

Hi

Gennady

I have tested the code with matrices 1070*1000000 (nzmax = 85534075) multiplied by a matrix 1000000*1070 (nzmax = 85534075) on (I do not know generation info for them):

2.8 GHz Intel Core i7, 16GB RAM
3.4 GHz Intel Core i7, 16GB RAM
2.80 Ghz Intel (RR) Xeon, 96 GB RAM
2.4 Ghz Intel i7, 8GB RAM.

You can test the code for your own matrices. I have to refactor the code .

Vineet

Gennady_F_Intel · ‎12-19-2014

thanks Vineet. Your suggestions looks reasonable to implement. I will bring this request to the MKL release board for consideration and may be for the implementation into one of the future versions.

regards, Gennady

Sparse-Sparse Matrix Multiplication