matrix-vector multiplication

meinyahanhoongmail_c · ‎05-30-2009

hey,
i have been working on matrix-vector multiplication using mkl (sparseBLAS routines). I used mkl_cspblas_dcsrgemv function for this. But the code doesn't use both the processors of the dual core machine. I tries the test function cblas_dgemm to see if it uses both the processors, and found that when i set MKL_NUM_THREADS=2 in bash, it uses both the processors but with MKL_NUM_THREADS=1, it uses only single processors. But this thing doesn't work with mkl_cspblas_dcsrgemv function.
Also, its written in the userguide that MKL is threaded in level 3 routines, sparseBLAS matrix-vector and matrix-matrix multiply routines. What does that exactly mean and how to use the threading with mkl_cspblas_dcsrgemv?

Thanx,
Regards.

TimP · ‎05-30-2009

According to MKL docs, OpenMP threading was added recently in level 2 matrix-vector multiply (?gemv). The number of threads set in MKL_NUM_THREADS or OMP_NUM_THREADS is a maximum; the MKL function will use fewer threads if the size and shape of the arguments don't exceed thresholds set in the functions, so as not to lose performance by using too many threads. I haven't seen multiple threads in my own examples of ?gemv usage.

meinyahanhoongmail_c · ‎05-31-2009

Quoting - tim18

According to MKL docs, OpenMP threading was added recently in level 2 matrix-vector multiply (?gemv). The number of threads set in MKL_NUM_THREADS or OMP_NUM_THREADS is a maximum; the MKL function will use fewer threads if the size and shape of the arguments don't exceed thresholds set in the functions, so as not to lose performance by using too many threads. I haven't seen multiple threads in my own examples of ?gemv usage.

but in my case, where i've used mkl_cspblas_dcsrgemv function, the order of matrix is very huge, and number of nonzeros is around 1% or even less, but still i never see the 2nd processor being used up at any stage. If here i can't use the mmulti-threading than i wonder where and how the threading is useful for speedingup. That's why i am confused.
Also, with the test function cblas_dgemm, the performance is better when single thread is used as compared to the 2 threads. why? and how can the performance be improved by using the multi-threading option in a better way?

TimP · ‎05-31-2009

Quoting - meinyahanhoongmail.com

how can the performance be improved by using the multi-threading option in a better way?

I haven't studied the organization of csrgemv, but it's certainly more difficult to benefit from threading than it would be for ?gemm.
I've noticed that MKL chooses 1 thread for dgemm, in the case of multiplication of 25x25 matrices, but gets a significant gain for 2 threads when multiplying a 25x25 times 25x100. Significantly more advantage may be obtained from multiple threads on problems in that size range by writing your own in-line matrix multiply, transposing one of the matrices so as to make inner loops stride 1, and forcing unroll and jam. I've seen 14 Gflops on Core i7. It's not necessarily advantageous for a whole program; it evicts everything else from data cache on all cores.
For smaller problems, of course, ifort -O3 MATMUL can out-perform MKL, even though there is no OpenMP threading in current implementations of MATMUL.