No scaling for mkl_dcsrsymv

admin4 · ‎04-26-2006

Hello,
I am wondering if the function mkl_dcsrsymv actually benefits from threading. Here is a snippet of code which performs a multiplication of a sparse symmetric matrix with a dense vector:
void spblas_multSymm(const int n, const int *const ptr, const int *const ind, const double *const val,
const double *const x, double *const y)
{
const char U = 'U';
double t0, t1, t_single, t_dual;
int iii;
omp_set_num_threads(1);
t0 = omp_get_wtime();
for(iii=0;iii<1000;iii++)
mkl_dcsrsymv(&U, &n, val, ptr, ind, x, y);
t1 = omp_get_wtime();
t_single = t1-t0;
omp_set_num_threads(2);
t0 = omp_get_wtime();
for(iii=0;iii<1000;iii++)
mkl_dcsrsymv(&U, &n, val, ptr, ind, x, y);
t1 = omp_get_wtime();
t_dual = t1-t0;
printf("Time for 1 Thread: %f ",t_single);
printf("Time for 2 Threads: %f ratio %f ", t_dual,t_dual/t_single);
}
I used the for-loops to have more total computation time spent on the actual call to mkl_dcsrsymv.
For input data I used an LES of dimension 120 (i.e 120x120 dim matrix, but only about 5% nonzero). I obtained no speed-up at all, i.e. the ratio was close to 1. Further, looking at the task-manager the cpu-usage was only around 50 percent.
In a similar setting, I tested the dgemm routine which resulted in nearly ideal speed-ups with 100% cpu-usage. I also obtained good scaling for the PARDISO solver. Finally, i put a call to cblas_dgemm right in front of the calls to mkl_dcsrsymv and obtained nearly the optimal speedup for dgemm. This leads me to the assumption that the problem is really within the mkl_dcsrsymv function.
Do you have any ideas why this doesn't scale with mkl_dcsrsymv?
My environment is:
MKL 8.1
MS Visual Studio 2003 with Intel Compiler 9
WindowsXP Pro SP2
Athlon64x2 4400+

Thank you very much for you comments,
Bernhard

AndrewC · ‎04-27-2006

Have you tried your experiment with much larger matrices?

120 x120 is really very small. I would be very suprised to see any speed-up at that size and with so few sparse elements. For example, I am using the level 3 sparse routines with matrices of size 300,000 x 300,000. I believe I see >50% on my dual CPU machine.

admin4 · ‎04-28-2006

Thanks for your answer,

you're right - this problem size is rather small. Actually, I mixed things up a bit: I have a grid consisting of 40x40 nodes with 3 DOFs each. Hence there are 4800 unknowns which renders the dimension of the LES's matrix 4800x4800. This is still much smaller than the problem you mentioned but I think I should get some speed-ups there. Maybe the level 2 Sparse BLAS routines aren't threaded at all?

AndrewC · ‎04-28-2006

I suppose what is critical is how many non-zero elements are present.The routine may not have enough NNZ to benefit from threading. Can you boost the problem size up a lot?

I am using this very same routine in egienvalue extraction ( with ARPACK) of an acoustic solid elements vibration problem.

Have a look at CSparse http://www.cise.ufl.edu/research/sparse/CSparse/ ( a package in 'C' by Tim Davis that I thoroughly recommend as useful toolkit for sparse matrices - it works well with MKL, [apart from having 0 based indexing]) and see how it does a sparse matrix/dense vector multiply.

admin4 · ‎04-30-2006

Hello,

I just learned from Intel premier support that the routine mkl_dcsrsymv has not yet been parallelized. So no need to look further.

Bernhard