- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
in a omp threaded application where mkl is called from several threads in parallel, for example:
#pragma omp prallel for num_threads(4)
for(int i=0;i<4;++i){
int save=mkl_set_num_threads_local(1)
dgemm(...);
mkl_set_num_threads_local(save)
}
When calling the threaded mkl version the number of local threads is set to one on purpose because the array sizes are very small.
I have noticed a substantial speed difference depending on whether the sequential version of mkl is linked (libmkl_sequential.a) or the threaded (libmkl_intel_thread.a). The program needs approximately 1.5 times more time when using threaded compared to using sequential.
I am wonder whether anything can be done to have both versions running at the same speed.
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us.
Could you please provide us with the following details so that we can work on it from our end?
MKL Version
Compiler used
OS Details & type of CPU
It would be helpful if you also share with us the complete sample reproducer (& steps to reproduce the issue if any), & how you are calculating the time for both the versions(sequential & threaded) so that it would help us to get more insights regarding the issue.
Regards,
Vidya.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
thanks for your response.
- mkl version was oneapi 2021.2.0
- compiler was intel oneapi 2021.2.0 clang++
- os: linux
- cpu: i9-9980HK
I'll try to compile a stand-alone example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Karl,
Thanks for providing the details.
We are working on your issue internally, we will get back to you soon.
>>I'll try to compile a stand-alone example.
Meanwhile, you can share your example code so that it would help us to get better insights regarding the issue.
Regards,
Vidya.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Karl,
Are there any reproducers here? Checking the problem on my end I see ~ the same performance for moderate and input problem sizes. The only difference we could see in the case when the input problem < 100. In such cases, if we can run the gemm many times and measure the minimum execution time, the performance would be the same as well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
thanks for looking into this.
Unfortunately with the program below I cannot reproduce the problem, and the program where it turned up is not a small reproducer. So I leave it as such for the time being.
Best
#include <string>
#include <iostream>
#include <sstream>
#include "mkl.h"
#include <vector>
#include <random>
int main(int argc, char** argv){
try{
std::stringstream ss;std::string x,msg;
if(argc!=4){
msg="error. require 3 command line arguments: row dimesion of matrix 1, column dimension of matrix 2, number of iterations.";throw msg;
}
long long i0nrow1=0,i0ncol2=0, niter=0;
x=argv[1]; ss<<x;ss >> i0nrow1;ss.clear();
x=argv[2]; ss<<x;ss >> i0ncol2;ss.clear();
x=argv[3]; ss<<x;ss >> niter;ss.clear();
if(i0nrow1<1 || i0ncol2<1 || niter<1){
msg="error. invalid dimensions";
throw msg;
}
std::random_device rd;
std::default_random_engine eng(rd());
std::uniform_real_distribution<double> distr(0,1);
std::vector<std::vector<double>> a,b,c;
a.resize(8);b.resize(8);c.resize(8);
for(int i=0;i<8;++i){
a[i].resize(i0nrow1*i0nrow1);b[i].resize(i0nrow1*i0ncol2);c[i].resize(i0nrow1*i0ncol2);
for(auto x : a[i]){x=distr(eng);}
for(auto x : b[i]){x=distr(eng);}
for(auto x : c[i]){x=0.0;}
}
#pragma omp parallel for num_threads(8)
for(int j=0;j<a.size();++j){
for(int i=0;i<niter;++i){
int save=mkl_set_num_threads_local(1);
cblas_dsymm(CblasColMajor,
CblasLeft,
CblasUpper,
i0nrow1,
i0ncol2,
1.0,
a[j].data(),
i0nrow1,
b[j].data(),
i0nrow1,
0.0,
c[j].data(),
i0nrow1
);
mkl_set_num_threads_local(save);
}
}
}catch(std::string msg){
std::cout<<"an error has occured: "+msg<<std::endl;
return(1);
}
return(0);
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This thread is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page