Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Create 8 VSLStreamStatePtr affected MKL "dtrsm"' s performance, include test code,issue still open

jian_l_1
Beginner
572 Views

I have to refile an issue since nobody follow up on my old post

 

dtrsm is affected by new 8 vslNewStream.

Here is test code and result in my machine:

result:

Before new 8 VSLStreamStatePtr time: 1

After new 8 VSLStreamStatePtr time: 12

Code (c++):

 

#include "mkl_vsl_functions.h"

#include "mkl_vsl_defines.h"

#include "mkl_blas.h"

#include "mkl_service.h"

 

 

int MmatrixARows=26;

int NmatrixBColumns=3;

double alpha=1;

int ldm=29;

double matrixA[87]={0.00311007,-1.12899e-05,-0.000141499,-1.82698e-14,-0.000785694,-1.98974e-14,-0.000778519,-2.71811e-14,-2.29056e-14,-2.7844e-14,-2.24393e-14,-3.12059e-14,-1.26095e-14,-4.47909e-10,-7.97785e-19,-1.74566e-07,-4.15789e-10,-2.17286e-29,-1.56053e-10,-1.34911e-12,-2.19906e-27,-3.5138e-09,-1.0398e-07,-5.29274e-06,-4.9252e-07,-8.93104e-05,-3.71938e-05,-1.28896e-09,-1.17735e-07,-3.26114e-08,0.00289051,-0.000149547,-2.34128e-13,-1.92531e-13,-0.000706043,-2.17513e-13,-3.48327e-13,-0.000670723,-3.56823e-13,-2.8756e-13,-3.99905e-13,-1.61591e-13,-5.73997e-09,-2.17241e-19,-1.42301e-08,-5.32835e-09,-5.47231e-30,-8.81321e-10,-3.39771e-13,-5.53829e-28,-8.70999e-10,-2.05336e-08,-5.30965e-06,-4.83782e-07,-8.84461e-05,-3.73614e-05,-1.5481e-11,-1.15641e-07,-4.02283e-07,-4.25622e-07,0.00303418,-0.00079269,-2.05185e-13,-2.71747e-13,-2.3181e-13,-0.000797297,-3.1283e-13,-3.80276e-13,-3.06461e-13,-4.2619e-13,-1.72212e-13,-6.11725e-09,-1.91907e-19,-1.90223e-07,-5.67858e-09,-4.75132e-30,-1.02734e-09,-2.95006e-13,-4.80861e-28,-5.38704e-10,-1.5342e-08,-1.98894e-06,-6.03159e-06,-8.35693e-05,-4.29327e-05,-1.62871e-10,-1.44174e-06};

double matrixB_ori[87]={-1.82698e-14,-0.000785694,-1.98974e-14,-0.000778519,-2.71811e-14,-2.29056e-14,-2.7844e-14,-2.24393e-14,-3.12059e-14,-1.26095e-14,-4.47909e-10,-7.97785e-19,-1.74566e-07,-4.15789e-10,-2.17286e-29,-1.56053e-10,-1.34911e-12,-2.19906e-27,-3.5138e-09,-1.0398e-07,-5.29274e-06,-4.9252e-07,-8.93104e-05,-3.71938e-05,-1.28896e-09,-1.17735e-07,-3.26114e-08,0.00289051,-0.000149547,-2.34128e-13,-1.92531e-13,-0.000706043,-2.17513e-13,-3.48327e-13,-0.000670723,-3.56823e-13,-2.8756e-13,-3.99905e-13,-1.61591e-13,-5.73997e-09,-2.17241e-19,-1.42301e-08,-5.32835e-09,-5.47231e-30,-8.81321e-10,-3.39771e-13,-5.53829e-28,-8.70999e-10,-2.05336e-08,-5.30965e-06,-4.83782e-07,-8.84461e-05,-3.73614e-05,-1.5481e-11,-1.15641e-07,-4.02283e-07,-4.25622e-07,0.00303418,-0.00079269,-2.05185e-13,-2.71747e-13,-2.3181e-13,-0.000797297,-3.1283e-13,-3.80276e-13,-3.06461e-13,-4.2619e-13,-1.72212e-13,-6.11725e-09,-1.91907e-19,-1.90223e-07,-5.67858e-09,-4.75132e-30,-1.02734e-09,-2.95006e-13,-4.80861e-28,-5.38704e-10,-1.5342e-08,-1.98894e-06,-6.03159e-06,-8.35693e-05,-4.29327e-05,-1.62871e-10,-1.44174e-06,-1.87786e-13,-2.49161e-13,-0.00079269};

double matrixB[87];

 

int sweepCount = 1e6;

time_t time1, time2, time3, time4;

time(&time1);

for(int count = 0;  count < sweepCount; ++count) {

    memcpy(matrixB, matrixB_ori, sizeof(double)*87);

    dtrsm("Right", "Upper", "No transpose", "Nunit", &MmatrixARows, &NmatrixBColumns, &alpha, matrixA, &ldm, matrixB, &ldm);

}

time(&time2);

std::cout<<" Before new 8 VSLStreamStatePtr time: "<<difftime(time2, time1)<<std::endl;

 

VSLStreamStatePtr                   ptr_[8];

for(int i = 0; i < 8; ++i) {

    vslNewStream(&ptr_, VSL_BRNG_MT2203 + i, 1);

}

 

time(&time3);

for(int count = 0;  count < sweepCount; ++count) {

    memcpy(matrixB, matrixB_ori, sizeof(double)*87);

    dtrsm("Right", "Upper", "No transpose", "Nunit", &MmatrixARows, &NmatrixBColumns, &alpha, matrixA, &ldm, matrixB, &ldm);

}

time(&time4);

std::cout<<"After new 8 VSLStreamStatePtr time: "<<difftime(time4, time3)<<std::endl;

 

Thanks

0 Kudos
5 Replies
Ying_H_Intel
Employee
572 Views

Hi Jian,

We can reproduce the problem, will investigate later. 

Yes, if possible, please provide us your OS and CPU information. or if it is privacy, please submit your question to software.intel.com/en-us/support/online-service-center

Best Regards,

Ying 
 

 

 

0 Kudos
jian_l_1
Beginner
572 Views

Hi, Ying

Thanks for your response.

My machine OS is:

CentOS release 6.8 (Final)

Kernel \r on an \m

 

CupInfo:

processor         : 1

vendor_id         : GenuineIntel

cpu family         : 6

model                : 44

model name     : Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz

stepping  : 2

microcode        : 19

cpu MHz           : 2925.814

cache size         : 12288 KB

physical id         : 0

siblings     : 12

core id               : 0

cpu cores : 6

apicid                 : 0

initial apicid      : 0

fpu             : yes

fpu_exception : yes

cpuid level        : 11

wp             : yes

flags                   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_t

sc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lah

f_lm ida arat dtherm tpr_shadow vnmi flexpriority ept vpid

bogomips          : 5851.80

clflush size        : 64

cache_alignment     : 64

address sizes   : 40 bits physical, 48 bits virtual

power management:

 

processor         : 23

vendor_id         : GenuineIntel

cpu family         : 6

model                : 44

model name     : Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz

stepping  : 2

microcode        : 19

cpu MHz           : 2925.814

cache size         : 12288 KB

physical id         : 0

siblings     : 12

core id               : 10

cpu cores : 6

apicid                 : 21

initial apicid      : 21

fpu             : yes

fpu_exception : yes

cpuid level        : 11

wp             : yes

flags                   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat dtherm tpr_shadow vnmi flexpriority ept vpid

bogomips          : 5851.80

clflush size        : 64

cache_alignment     : 64

address sizes   : 40 bits physical, 48 bits virtual

power management:

 

0 Kudos
Gennady_F_Intel
Moderator
572 Views

I do believe the problem with how you measure this exec time. Could you somehow remove memcpy from your measurement cycle.

otherwise, you measure not only trsm but memcpy performance too.

0 Kudos
jian_l_1
Beginner
572 Views

Hi, Gennady F

Because dtrsm overwrite matrixB as result, so I use memcpy to initialize matrixB before call dtrsm. I don't think memcpy cause the performance issue.

Thanks

0 Kudos
Zhen_Z_Intel
Employee
572 Views

Hi Jian,

We already escalated this problem and start to fix the internal memory allocation problem. Cause you submitted two tickets with same issue, I will close this thread and update info in the other issue. Please watch ticket with no. #748385. Thank you.

Best regards,
Fiona

0 Kudos
Reply