about parallelism on BLAS level-1 routines and VML

Kim_L_ · ‎12-04-2014

Hi all,

I am running BLAS routines in MKL with intel compiler (icpc). Following the example given in the compiler, I try to set the numbers of threads from 1 to 10 while running dgemm routine for matrix-matrix multiplication and I saw the speedup while increasing the number of threads. However, for level-1 routines (e.g. cblas_zcopy, cblas_zaxpby), I didn't see any speed up for multithreading version. I wonder if there is any multi-threading version for level-1 routines or not? What about the VML routines? I also try to use those routines (e.g. vzExp, vzMul) but no speedup at all in multithreading environment.

Vamsi_S_Intel · ‎12-05-2014

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

Kim_L_ · ‎12-05-2014

Vamsi Sripathi (Intel) wrote:

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

Thanks for your reply. I am running the vector of 8192 elements to 12288 elements on computer equipped with Intel® Xeon® Processor E5-2620

Kim_L_ · ‎12-05-2014

Vamsi Sripathi (Intel) wrote:

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

I am looking for the multithread version to work so to speed up the code in efficient way. In my calculation, I have so many complicated calculations in the form

alpha*x*conj(y)

or

exp(a*x + b*y)*z

where alpha, a, b are constants and x, y, z are vectors. I am using vzExp and vzMul to implement the first operation, and using cblas_zaxpby, vzExp, vzMul for the second one. Any better idea to do so? Thanks.

Gennady_F_Intel · ‎12-06-2014

Kim L. wrote:

Quote:

Vamsi Sripathi (Intel) wrote:

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

Thanks for your reply. I am running the vector of 8192 elements to 12288 elements on computer equipped with Intel® Xeon® Processor E5-2620

here I would recommend to see at the https://software.intel.com/en-us/articles/intel-mkl-vmlvsl-training-material - foil #7 - Performance metric.