MKL "only" twice faster on a 8-Cores Machine

dar_io · ‎04-30-2008

Hello!

I wrote a Conjugate Gradient solver to solve big sparse systems and I parallelized it with SSE and OpenMP.

I'm working on a 8-Cores Mac Pro, and I reached a speedup of 9.0x in some cases (wrt to my non-parallel implementation).
I would compare my results with MKL so I used the dcg_init, dcg_check, dcg and dcg_get methods to solve my system.
For the crucial part of the Conjugate Gradient (the sparse matrix vector multiplication) I used the mkl_dcsrmv method.

I set the environment variable OMP_NUM_THREADS to 8 in my shell and I checked with a profiler that all the 8 cores were working 100%. Unfortunately, with MKL I have a speedup of "only" 2.0x (wrt to MKL working serially).

The matrix of my problem is square and sparse with ~2400000 non-zero elements and ~35000 rows.

Am I doing something wrong, am I forgetting something? Or, is the size of my problem to small to see big performance improvement with MKL?

Thanks in advance!
Best!

dario

TimP · ‎04-30-2008

Sparse matrix multiplication typically is memory bandwidth limited, with a high cache miss rate. In such cases, you may find that performance saturates at 2 or 4 threads. You should test the KMP_AFFINITY environment variable settings, KMP_AFFINITY=compact (or scatter). compact and scatter are likely to be the same, except at 4 threads.
It may be that disabling second/alternate sector prefetch could improve performance. I don't have access to Mac specific information on this. A few platforms may have a BIOS setup option. Without that, on linux, it involves root privilege, and an application to alter MSR (model specific register) settings.

dar_io · ‎05-01-2008

Thank you very much for your answer!
I will check and redo the tests with your advices!
Googling a bit, I discovered that "...thread affinity can have a dramatic effect on the application speed" (from intel.com).

Best,

dario