mkl_dcsrmv slower than openMP implementation

AJ_F_ · ‎03-04-2014

Hi,

I'm trying to find the fastest way to do a multithreaded sparse matrix-vector multiply. I've written some benchmarking code to form a large random sparse matrix in CSR format, and then time 3 different implementations to compute y = y + A*x. I have a serial implementation, an openMP implementation, and mkl_dcsrmv. I'm computing the average and minimum time over a number of runs, say, 10.

Strangely, though, the openMP implementation beats MKL always. For the matrix sizes in the code, openMP has a min time of 0.199272 seconds, while MKL has a min time of 0.249399 seconds over 10 runs. This is for a matrix with about 256 million nonzeros.

I'm running this on a machine with 32 cores. I've adjusted the number of threads and played with the KMP_AFFINITY environment variable. The openMP code does better in every case.

Any idea why I'm getting these results? Perhaps I'm using MKL sub-optimally? Any help would be greatly appreciated.

I've attached the code I'm running. I compile with "icc -mkl -openmp rand_mat.c"

Thanks,

AJ

TimP · ‎03-04-2014

Mkl introduced optimized dcsrmv only a year ago so your conclusion would be expected for earlier versions.

If you call mkl in a parallel region the default will be not to use additional threads.

AJ_F_ · ‎03-04-2014

I'm using Intel Composer XE 2013 SP1, which would come with MKL 11.1, so the version shouldn't be an issue, yes?

I'm not calling MKL in a parallel region. I'm using openMP only in the implementation in the other spMV. Also, I can see the speedup in MKL as I use additional threads, so multiple threads are definitely being used.

What else might be causing it? Are there any tricks I'm missing to maximize the mkl speedup?

AJ

Chao_Y_Intel · ‎03-04-2014

AJ,

Thanks for test code. We will further check on the code. btw, what is the procoessor that you got this issue?

regards,
Chao

AJ_F_ · ‎03-05-2014

I'm running on an Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz

I have 32 cores over 4 sockets.

Chao_Y_Intel · ‎06-14-2015

Hi AJ,

The engineer owner have some investigation on the performance problem.
The matrix uses non-sorted column indexes which leads to ineffective cache utilization during row to vector multiplication.
And this explains why increasing number of parallel jobs gets more GFlop/s - jobs waiting on cache misses give way for other jobs for which the data is in the cache.
so the suggestion is to sort column indexes before calling MKL CSRMV. Then he will be able to get all value added performance from MKL.

thanks,
Chao