topic Re:processing time sequential vs threaded+mkl_set_num_threads_local(1) in Intel® oneAPI Math Kernel Library

processing time sequential vs threaded+mkl_set_num_threads_local(1)

may_ka — Tue, 07 Dec 2021 00:40:12 GMT

Hi,

in a omp threaded application where mkl is called from several threads in parallel, for example:

#pragma omp prallel for num_threads(4) for(int i=0;i<4;++i){ int save=mkl_set_num_threads_local(1) dgemm(...); mkl_set_num_threads_local(save) }

When calling the threaded mkl version the number of local threads is set to one on purpose because the array sizes are very small.

I have noticed a substantial speed difference depending on whether the sequential version of mkl is linked (libmkl_sequential.a) or the threaded (libmkl_intel_thread.a). The program needs approximately 1.5 times more time when using threaded compared to using sequential.

I am wonder whether anything can be done to have both versions running at the same speed.

Thanks.

Re: processing time sequential vs threaded+mkl_set_num_threads_local(1)

VidyalathaB_Intel — Tue, 07 Dec 2021 12:38:18 GMT

Hi,

Thanks for reaching out to us.

Could you please provide us with the following details so that we can work on it from our end?

MKL Version

Compiler used

OS Details & type of CPU

It would be helpful if you also share with us the complete sample reproducer (& steps to reproduce the issue if any), & how you are calculating the time for both the versions(sequential & threaded) so that it would help us to get more insights regarding the issue.

Regards,

Vidya.

Re: processing time sequential vs threaded+mkl_set_num_threads_local(1)

may_ka — Tue, 07 Dec 2021 22:56:53 GMT

Hi,

thanks for your response.

mkl version was oneapi 2021.2.0
compiler was intel oneapi 2021.2.0 clang++
os: linux
cpu: i9-9980HK

I'll try to compile a stand-alone example.

Re:processing time sequential vs threaded+mkl_set_num_threads_local(1)

VidyalathaB_Intel — Wed, 08 Dec 2021 05:03:51 GMT

Hi Karl,

Thanks for providing the details.

We are working on your issue internally, we will get back to you soon.

>>I'll try to compile a stand-alone example.

Meanwhile, you can share your example code so that it would help us to get better insights regarding the issue.

Regards,

Vidya.

Re:processing time sequential vs threaded+mkl_set_num_threads_local(1)

Gennady_F_Intel — Mon, 13 Dec 2021 03:51:44 GMT

Karl,

Are there any reproducers here? Checking the problem on my end I see ~ the same performance for moderate and input problem sizes. The only difference we could see in the case when the input problem < 100. In such cases, if we can run the gemm many times and measure the minimum execution time, the performance would be the same as well.

Re: processing time sequential vs threaded+mkl_set_num_threads_local(1)

may_ka — Mon, 13 Dec 2021 11:00:57 GMT

Hi,

thanks for looking into this.

Unfortunately with the program below I cannot reproduce the problem, and the program where it turned up is not a small reproducer. So I leave it as such for the time being.

Best

#include <string> #include <iostream> #include <sstream> #include "mkl.h" #include <vector> #include <random> int main(int argc, char** argv){ try{ std::stringstream ss;std::string x,msg; if(argc!=4){ msg="error. require 3 command line arguments: row dimesion of matrix 1, column dimension of matrix 2, number of iterations.";throw msg; } long long i0nrow1=0,i0ncol2=0, niter=0; x=argv[1]; ss<<x;ss >> i0nrow1;ss.clear(); x=argv[2]; ss<<x;ss >> i0ncol2;ss.clear(); x=argv[3]; ss<<x;ss >> niter;ss.clear(); if(i0nrow1<1 || i0ncol2<1 || niter<1){ msg="error. invalid dimensions"; throw msg; } std::random_device rd; std::default_random_engine eng(rd()); std::uniform_real_distribution<double> distr(0,1); std::vector<std::vector<double>> a,b,c; a.resize(8);b.resize(8);c.resize(8); for(int i=0;i<8;++i){ a[i].resize(i0nrow1*i0nrow1);b[i].resize(i0nrow1*i0ncol2);c[i].resize(i0nrow1*i0ncol2); for(auto x : a[i]){x=distr(eng);} for(auto x : b[i]){x=distr(eng);} for(auto x : c[i]){x=0.0;} } #pragma omp parallel for num_threads(8) for(int j=0;j<a.size();++j){ for(int i=0;i<niter;++i){ int save=mkl_set_num_threads_local(1); cblas_dsymm(CblasColMajor, CblasLeft, CblasUpper, i0nrow1, i0ncol2, 1.0, a[j].data(), i0nrow1, b[j].data(), i0nrow1, 0.0, c[j].data(), i0nrow1 ); mkl_set_num_threads_local(save); } } }catch(std::string msg){ std::cout<<"an error has occured: "+msg<<std::endl; return(1); } return(0); }

Re:processing time sequential vs threaded+mkl_set_num_threads_local(1)

Gennady_F_Intel — Fri, 17 Dec 2021 03:46:14 GMT

This thread is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.