Csrmultcsr utilizes simple parallelization strategy (row-wise for the first matrix).
Scaling is heavily depend on the matrix structures and sizes, and normally, the scalability is bound by the memory subsystem bandwidth.
We would expect to see better performance on Xeon Phi vs. Xeon, but scalability would normally be quite modest as memory bandwidth is exhausted pretty quickly.
I suppose that a chunk of rows of the first matrix is assigned to a thread. Is it possible to set chunk size and OpenMP's scheduling policy? Are there any other user-supplied parameters to reduce run time of mkl_?csrmultcsr on MIC? (Parameters other than OMP_NUM_THREADS and KMP_AFFINITY)
mkl_?csrmultcsr has the following parallelization strategy for non-transposed multiplication of two CSR matrices:first matrix is divided on chunks with more or less equal number of rows, and every chunk is assigned to a thread. Since the number of chunks is equal to number of threads, the chunk size can’t be set by the user outside MKL.
Could you please provide us with the use case of this function? Matrix sizes, sparsity pattern, input parameters, etc?
We can take a look at the testcase and then suggest possible steps in further tuning.