mkl_dcsrgemv doesn't yield any significant speedup

agnonchik · ‎07-30-2009

Hi,

Is it expected that mkl_dcsrgemv() doesn't outperform hand-coded routine compiled with VS8?
On my matrix, MKL yields 45.011 msecs, while hand-coded routne 46.04 msecs.

Another observation is that two-threads are only 10% faster than a single thread.

The matrix size is 2Mx2M, nnz=15M.

Thanks,
Agnonchik.

Sergey_K_Intel1 · ‎07-30-2009

Quoting - agnonchik

Hi,

Is it expected that mkl_dcsrgemv() doesn't outperform hand-coded routine compiled with VS8?
On my matrix, MKL yields 45.011 msecs, while hand-coded routne 46.04 msecs.

Another observation is that two-threads are only 10% faster than a single thread.

The matrix size is 2Mx2M, nnz=15M.

Thanks,
Agnonchik.

Hi,

The routine you mentioned is aLevel 2 routine. The performance behavour described by you is typical for dense Level 2 BLAS and Level 2Sparse BLASroutines when the size of data exceeds the size of cache memory. For large data sizes performance of dense Level 2 BLAS as well as Sparse BLAS Level 2mainly depends onmemory bandwidth because of small amount of arithmetic operations. So changing algorithm for the usage of Level 3 can help to improve performance because Level 3 routines can reuse data in cache.

Unlike dense BLAS, performance of Sparse BLAS depends on matrix structure.

All the best
Sergey

agnonchik · ‎07-31-2009

Thanks for your explanation!
Agnonchik.