topic Quote:Vamsi Sripathi (Intel) in Intel® oneAPI Math Kernel Library

about parallelism on BLAS level-1 routines and VML

Kim_L_ — Fri, 05 Dec 2014 04:19:27 GMT

Hi all,

I am running BLAS routines in MKL with intel compiler (icpc). Following the example given in the compiler, I try to set the numbers of threads from 1 to 10 while running dgemm routine for matrix-matrix multiplication and I saw the speedup while increasing the number of threads. However, for level-1 routines (e.g. cblas_zcopy, cblas_zaxpby), I didn't see any speed up for multithreading version. I wonder if there is any multi-threading version for level-1 routines or not? What about the VML routines? I also try to use those routines (e.g. vzExp, vzMul) but no speedup at all in multithreading environment.

Hi Kim,

Vamsi_S_Intel — Fri, 05 Dec 2014 19:26:14 GMT

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

Quote:Vamsi Sripathi (Intel)

Kim_L_ — Fri, 05 Dec 2014 19:31:36 GMT

Vamsi Sripathi (Intel) wrote:

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

Thanks for your reply. I am running the vector of 8192 elements to 12288 elements on computer equipped with Intel® Xeon® Processor E5-2620

Quote:Vamsi Sripathi (Intel)

Kim_L_ — Fri, 05 Dec 2014 19:37:02 GMT

Vamsi Sripathi (Intel) wrote:

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

I am looking for the multithread version to work so to speed up the code in efficient way. In my calculation, I have so many complicated calculations in the form

alpha*x*conj(y)

exp(a*x + b*y)*z

where alpha, a, b are constants and x, y, z are vectors. I am using vzExp and vzMul to implement the first operation, and using cblas_zaxpby, vzExp, vzMul for the second one. Any better idea to do so? Thanks.

Quote:Kim L. wrote:

Gennady_F_Intel — Sat, 06 Dec 2014 08:34:21 GMT

Kim L. wrote:

Quote:

Vamsi Sripathi (Intel) wrote:

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

Thanks for your reply. I am running the vector of 8192 elements to 12288 elements on computer equipped with Intel® Xeon® Processor E5-2620

here I would recommend to see at the https://software.intel.com/en-us/articles/intel-mkl-vmlvsl-training-material - foil #7 - Performance metric.