Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Create 8 VSLStreamStatePtr affected MKL "dtrsm"' s performance, include test code,issue still open

jian_l_1
Beginner
828 Views

At first I want to generate random in multythreads in the following code:

#define nstreams 8
VSLStreamStatePtr stream[nstreams];

int k;
for ( k=0; k< nstreams; k++ )
{
vslNewStream( &stream, VSL_BRNG_MT2203+k, seed );
}

But I found, If I generate 8 VSLStreamStatePtr , other MKL functions performance will be affected(5 times slower then normal), these affected funtions are:

dtrsm("Right", "Upper", "No transpose", "Nunit", ...);

 

 

 
0 Kudos
1 Solution
Gennady_F_Intel
Moderator
828 Views

The root cause analysis shows the problem  with internal mkl_serv_allocate() routine. The issue is escalated. We will keep you updated with the status of this issue!

 

View solution in original post

0 Kudos
8 Replies
Zhen_Z_Intel
Employee
828 Views

Hi jian,

Here's some question about your issue:

1. How did you test the performance? If you enable MKL_VERBOSE to check, or write program to get clock time?
2. How about your problem size for trsv, gemv and trsm? and what about your seed for random data generation? Could you please provide a reproducer (just a sample case) that we can investigate?

Thanks.

Best regards,
Fiona

0 Kudos
jian_l_1
Beginner
828 Views

Hi, Fiona

Thanks for your response.

I need modify the issue: ONLY dtrsm is affected by new 8 vslNewStream.

Here is test code and result in my machine:

result:

Before new 8 VSLStreamStatePtr time: 1

After new 8 VSLStreamStatePtr time: 12

Code (c++):
 

#include "mkl_vsl_functions.h"

#include "mkl_vsl_defines.h"

#include "mkl_blas.h"

#include "mkl_service.h"

 

 

int MmatrixARows=26;

int NmatrixBColumns=3;

double alpha=1;

int ldm=29;

double matrixA[87]={0.00311007,-1.12899e-05,-0.000141499,-1.82698e-14,-0.000785694,-1.98974e-14,-0.000778519,-2.71811e-14,-2.29056e-14,-2.7844e-14,-2.24393e-14,-3.12059e-14,-1.26095e-14,-4.47909e-10,-7.97785e-19,-1.74566e-07,-4.15789e-10,-2.17286e-29,-1.56053e-10,-1.34911e-12,-2.19906e-27,-3.5138e-09,-1.0398e-07,-5.29274e-06,-4.9252e-07,-8.93104e-05,-3.71938e-05,-1.28896e-09,-1.17735e-07,-3.26114e-08,0.00289051,-0.000149547,-2.34128e-13,-1.92531e-13,-0.000706043,-2.17513e-13,-3.48327e-13,-0.000670723,-3.56823e-13,-2.8756e-13,-3.99905e-13,-1.61591e-13,-5.73997e-09,-2.17241e-19,-1.42301e-08,-5.32835e-09,-5.47231e-30,-8.81321e-10,-3.39771e-13,-5.53829e-28,-8.70999e-10,-2.05336e-08,-5.30965e-06,-4.83782e-07,-8.84461e-05,-3.73614e-05,-1.5481e-11,-1.15641e-07,-4.02283e-07,-4.25622e-07,0.00303418,-0.00079269,-2.05185e-13,-2.71747e-13,-2.3181e-13,-0.000797297,-3.1283e-13,-3.80276e-13,-3.06461e-13,-4.2619e-13,-1.72212e-13,-6.11725e-09,-1.91907e-19,-1.90223e-07,-5.67858e-09,-4.75132e-30,-1.02734e-09,-2.95006e-13,-4.80861e-28,-5.38704e-10,-1.5342e-08,-1.98894e-06,-6.03159e-06,-8.35693e-05,-4.29327e-05,-1.62871e-10,-1.44174e-06};

double matrixB_ori[87]={-1.82698e-14,-0.000785694,-1.98974e-14,-0.000778519,-2.71811e-14,-2.29056e-14,-2.7844e-14,-2.24393e-14,-3.12059e-14,-1.26095e-14,-4.47909e-10,-7.97785e-19,-1.74566e-07,-4.15789e-10,-2.17286e-29,-1.56053e-10,-1.34911e-12,-2.19906e-27,-3.5138e-09,-1.0398e-07,-5.29274e-06,-4.9252e-07,-8.93104e-05,-3.71938e-05,-1.28896e-09,-1.17735e-07,-3.26114e-08,0.00289051,-0.000149547,-2.34128e-13,-1.92531e-13,-0.000706043,-2.17513e-13,-3.48327e-13,-0.000670723,-3.56823e-13,-2.8756e-13,-3.99905e-13,-1.61591e-13,-5.73997e-09,-2.17241e-19,-1.42301e-08,-5.32835e-09,-5.47231e-30,-8.81321e-10,-3.39771e-13,-5.53829e-28,-8.70999e-10,-2.05336e-08,-5.30965e-06,-4.83782e-07,-8.84461e-05,-3.73614e-05,-1.5481e-11,-1.15641e-07,-4.02283e-07,-4.25622e-07,0.00303418,-0.00079269,-2.05185e-13,-2.71747e-13,-2.3181e-13,-0.000797297,-3.1283e-13,-3.80276e-13,-3.06461e-13,-4.2619e-13,-1.72212e-13,-6.11725e-09,-1.91907e-19,-1.90223e-07,-5.67858e-09,-4.75132e-30,-1.02734e-09,-2.95006e-13,-4.80861e-28,-5.38704e-10,-1.5342e-08,-1.98894e-06,-6.03159e-06,-8.35693e-05,-4.29327e-05,-1.62871e-10,-1.44174e-06,-1.87786e-13,-2.49161e-13,-0.00079269};

double matrixB[87];

 

int sweepCount = 1e6;

time_t time1, time2, time3, time4;

time(&time1);

for(int count = 0;  count < sweepCount; ++count) {

    memcpy(matrixB, matrixB_ori, sizeof(double)*87);

    dtrsm("Right", "Upper", "No transpose", "Nunit", &MmatrixARows, &NmatrixBColumns, &alpha, matrixA, &ldm, matrixB, &ldm);

}

time(&time2);

std::cout<<" Before new 8 VSLStreamStatePtr time: "<<difftime(time2, time1)<<std::endl;

 

VSLStreamStatePtr                   ptr_[8];

for(int i = 0; i < 8; ++i) {

    vslNewStream(&ptr_, VSL_BRNG_MT2203 + i, 1);

}

 

time(&time3);

for(int count = 0;  count < sweepCount; ++count) {

    memcpy(matrixB, matrixB_ori, sizeof(double)*87);

    dtrsm("Right", "Upper", "No transpose", "Nunit", &MmatrixARows, &NmatrixBColumns, &alpha, matrixA, &ldm, matrixB, &ldm);

}

time(&time4);

std::cout<<"After new 8 VSLStreamStatePtr time: "<<difftime(time4, time3)<<std::endl;

 

Thanks

 
0 Kudos
jian_l_1
Beginner
828 Views

Hi, Fiona

Can you repeat the issue? Or need more detail info about lib verstion and cpu info?

 

Thanks. 

 

Fiona Z. (Intel) wrote:

Hi jian,

Here's some question about your issue:

1. How did you test the performance? If you enable MKL_VERBOSE to check, or write program to get clock time?
2. How about your problem size for trsv, gemv and trsm? and what about your seed for random data generation? Could you please provide a reproducer (just a sample case) that we can investigate?

Thanks.

Best regards,
Fiona

0 Kudos
Zhen_Z_Intel
Employee
828 Views

Hi Jian,

I can reproduce your problem, we are investigating, I will give your response soon.

Best regards,
Fiona

0 Kudos
Gennady_F_Intel
Moderator
829 Views

The root cause analysis shows the problem  with internal mkl_serv_allocate() routine. The issue is escalated. We will keep you updated with the status of this issue!

 

0 Kudos
jian_l_1
Beginner
828 Views

Hi,Gennay

I am glad to hear that.Thanks for your help. 

Gennady F. (Intel) wrote:

The root cause analysis shows the problem  with internal mkl_serv_allocate() routine. The issue is escalated. We will keep you updated with the status of this issue!

 

0 Kudos
Gennady_F_Intel
Moderator
828 Views

To mitigate the problem, we may recommend set MKL_DISABLE_FAST_MM=1 to disable our memory buffering. Please refer more details into MKL's User's Guide -  Managing Performance and Memory

 

0 Kudos
jian_l_1
Beginner
828 Views

Hi, Gennady

Thanks for your help.

I tried  set MKL_DISABLE_FAST_MM=1 , But it make dtrsm which before Create 8 VSLStreamStatePtr become as slow as after them.

My code is linked with google's tcmalloc, which can be found in gperftools-gperftools-2.5.

And add a unused map before the code can help repeat the issue, the map is original code is a static map.

#include <map>

#include <iostream>

#include <cstring>

#include "mkl_vsl_functions.h"

#include "mkl_vsl_defines.h"

#include "mkl_blas.h"

#include "mkl_service.h"

 

class aaaValue

{

public:

 

    ~aaaValue()                                          { aaa(); }

private:

 

    void aaa() {

            if (val_.sval)

                free(val_.sval);

            val_.sval = 0;

    }

 

private:

    union { int ival; double dval; char* sval; }    val_;

};

 

 

int main(int argc, const char* argv[])

{

 

    std::map<int, aaaValue> Map;

 

    int MmatrixARows=26;

    int NmatrixBColumns=3;

    double alpha=1;

    int ldm=29;

    double matrixA[87]={0.00311007,-1.12899e-05,-0.000141499,-1.82698e-14,-0.000785694,-1.98974e-14,-0.000778519,-2.71811e-14,-2.29056e-14,-2.7844e-14,-2.24393e-14,-3.12059e-14,-1.26095e-14,-4.47909e-10,-7.97785e-19,-1.74566e-07,-4.15789e-10,-2.17286e-29,-1.56053e-10,-1.34911e-12,-2.19906e-27,-3.5138e-09,-1.0398e-07,-5.29274e-06,-4.9252e-07,-8.93104e-05,-3.71938e-05,-1.28896e-09,-1.17735e-07,-3.26114e-08,0.00289051,-0.000149547,-2.34128e-13,-1.92531e-13,-0.000706043,-2.17513e-13,-3.48327e-13,-0.000670723,-3.56823e-13,-2.8756e-13,-3.99905e-13,-1.61591e-13,-5.73997e-09,-2.17241e-19,-1.42301e-08,-5.32835e-09,-5.47231e-30,-8.81321e-10,-3.39771e-13,-5.53829e-28,-8.70999e-10,-2.05336e-08,-5.30965e-06,-4.83782e-07,-8.84461e-05,-3.73614e-05,-1.5481e-11,-1.15641e-07,-4.02283e-07,-4.25622e-07,0.00303418,-0.00079269,-2.05185e-13,-2.71747e-13,-2.3181e-13,-0.000797297,-3.1283e-13,-3.80276e-13,-3.06461e-13,-4.2619e-13,-1.72212e-13,-6.11725e-09,-1.91907e-19,-1.90223e-07,-5.67858e-09,-4.75132e-30,-1.02734e-09,-2.95006e-13,-4.80861e-28,-5.38704e-10,-1.5342e-08,-1.98894e-06,-6.03159e-06,-8.35693e-05,-4.29327e-05,-1.62871e-10,-1.44174e-06};

    double matrixB_ori[87]={-1.82698e-14,-0.000785694,-1.98974e-14,-0.000778519,-2.71811e-14,-2.29056e-14,-2.7844e-14,-2.24393e-14,-3.12059e-14,-1.26095e-14,-4.47909e-10,-7.97785e-19,-1.74566e-07,-4.15789e-10,-2.17286e-29,-1.56053e-10,-1.34911e-12,-2.19906e-27,-3.5138e-09,-1.0398e-07,-5.29274e-06,-4.9252e-07,-8.93104e-05,-3.71938e-05,-1.28896e-09,-1.17735e-07,-3.26114e-08,0.00289051,-0.000149547,-2.34128e-13,-1.92531e-13,-0.000706043,-2.17513e-13,-3.48327e-13,-0.000670723,-3.56823e-13,-2.8756e-13,-3.99905e-13,-1.61591e-13,-5.73997e-09,-2.17241e-19,-1.42301e-08,-5.32835e-09,-5.47231e-30,-8.81321e-10,-3.39771e-13,-5.53829e-28,-8.70999e-10,-2.05336e-08,-5.30965e-06,-4.83782e-07,-8.84461e-05,-3.73614e-05,-1.5481e-11,-1.15641e-07,-4.02283e-07,-4.25622e-07,0.00303418,-0.00079269,-2.05185e-13,-2.71747e-13,-2.3181e-13,-0.000797297,-3.1283e-13,-3.80276e-13,-3.06461e-13,-4.2619e-13,-1.72212e-13,-6.11725e-09,-1.91907e-19,-1.90223e-07,-5.67858e-09,-4.75132e-30,-1.02734e-09,-2.95006e-13,-4.80861e-28,-5.38704e-10,-1.5342e-08,-1.98894e-06,-6.03159e-06,-8.35693e-05,-4.29327e-05,-1.62871e-10,-1.44174e-06,-1.87786e-13,-2.49161e-13,-0.00079269};

    double matrixB[87];

 

    int sweepCount = 1e5;

    time_t time1, time2, time3, time4;

    time(&time1);

    for(int count = 0;  count < sweepCount; ++count) {

        memcpy(matrixB, matrixB_ori, sizeof(double)*87);

        dtrsm("Right", "Upper", "No transpose", "Nunit", &MmatrixARows, &NmatrixBColumns, &alpha, matrixA, &ldm, matrixB, &ldm);

    }

    time(&time2);

    std::cout<<" Before new 8 VSLStreamStatePtr time: "<<difftime(time2, time1)<<std::endl;

 

    VSLStreamStatePtr                   ptr_[8];

    for(int i = 0; i < 8; ++i) {

        vslNewStream(&ptr_, VSL_BRNG_MT2203 + i, 1);

    }

 

    time(&time3);

    for(int count = 0;  count < sweepCount; ++count) {

        memcpy(matrixB, matrixB_ori, sizeof(double)*87);

        dtrsm("Right", "Upper", "No transpose", "Nunit", &MmatrixARows, &NmatrixBColumns, &alpha, matrixA, &ldm, matrixB, &ldm);

    }

    time(&time4);

    std::cout<<"After new 8 VSLStreamStatePtr time: "<<difftime(time4, time3)<<std::endl;

 

}

 

 

 

 

0 Kudos
Reply