Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

about parallelism on BLAS level-1 routines and VML

Kim_L_
Beginner
1,198 Views

Hi all,

I am running BLAS routines in MKL with intel compiler (icpc). Following the example given in the compiler, I try to set the numbers of threads from 1 to 10 while running dgemm routine for matrix-matrix multiplication and I saw the speedup while increasing the number of threads. However, for level-1 routines (e.g. cblas_zcopy, cblas_zaxpby), I didn't see any speed up for multithreading version. I wonder if there is any multi-threading version for level-1 routines or not? What about the VML routines? I also try to use those routines (e.g. vzExp, vzMul) but no speedup at all in multithreading environment.

0 Kudos
4 Replies
Vamsi_S_Intel
Employee
1,198 Views

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

0 Kudos
Kim_L_
Beginner
1,198 Views

Vamsi Sripathi (Intel) wrote:

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

Thanks for your reply. I am running the vector of 8192 elements to 12288 elements on computer equipped with Intel® Xeon® Processor E5-2620

0 Kudos
Kim_L_
Beginner
1,198 Views

Vamsi Sripathi (Intel) wrote:

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

I am looking for the multithread version to work so to speed up the code in efficient way. In my calculation, I have so many complicated calculations in the form

alpha*x*conj(y)

or

exp(a*x + b*y)*z

where alpha, a, b are constants and x, y, z are vectors. I am using  vzExp and vzMul to implement the first operation, and using cblas_zaxpby, vzExp, vzMul for the second one. Any better idea to do so? Thanks.

0 Kudos
Gennady_F_Intel
Moderator
1,198 Views

Kim L. wrote:

Quote:

Vamsi Sripathi (Intel) wrote:

 

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

 

 

Thanks for your reply. I am running the vector of 8192 elements to 12288 elements on computer equipped with Intel® Xeon® Processor E5-2620

here I would recommend to see at the https://software.intel.com/en-us/articles/intel-mkl-vmlvsl-training-material - foil #7 - Performance metric.

0 Kudos
Reply