- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi all,

I am running BLAS routines in MKL with intel compiler (icpc). Following the example given in the compiler, I try to set the numbers of threads from 1 to 10 while running dgemm routine for matrix-matrix multiplication and I saw the speedup while increasing the number of threads. However, for level-1 routines (e.g. cblas_zcopy, cblas_zaxpby), I didn't see any speed up for multithreading version. I wonder if there is any multi-threading version for level-1 routines or not? What about the VML routines? I also try to use those routines (e.g. vzExp, vzMul) but no speedup at all in multithreading environment.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Vamsi Sripathi (Intel) wrote:

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

Thanks for your reply. I am running the vector of 8192 elements to 12288 elements on computer equipped with Intel® Xeon® Processor E5-2620

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Vamsi Sripathi (Intel) wrote:

Hi Kim,

Typically, one need to use large vectors (in the order of tens of thousands) to see benefit from multi-threading for BLAS level-1 zcopy and zaxpby functions.

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

In MKL, even though the above level-1 functions are threaded, MKL may not always use multiple threads because the problem size may be too small to benefit from multi-threading.

I am looking for the multithread version to work so to speed up the code in efficient way. In my calculation, I have so many complicated calculations in the form

alpha*x*conj(y)

or

exp(a*x + b*y)*z

where alpha, a, b are constants and x, y, z are vectors. I am using vzExp and vzMul to implement the first operation, and using cblas_zaxpby, vzExp, vzMul for the second one. Any better idea to do so? Thanks.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Kim L. wrote:

Quote:Vamsi Sripathi (Intel)wrote:

Hi Kim,

Could you please provide the following info,

1. Vector dimensions used in invoking cblas_zcopy and cblas_zaxpby

2. CPU architecture

Thanks for your reply. I am running the vector of 8192 elements to 12288 elements on computer equipped with Intel® Xeon® Processor E5-2620

here I would recommend to see at the https://software.intel.com/en-us/articles/intel-mkl-vmlvsl-training-material - foil #7 - Performance metric.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page