Community
cancel
Showing results for 
Search instead for 
Did you mean: 
132 Views

random malloc error in d_commit_trig_transform

Greetings,

I'm experiencing a random malloc error in d_commit_trig_transform.  It's persistent, but happens at different times during the code execution as this routine is called repeatedly.  I'm currently testing on a Mac using the latest available Intel C++ compiler and MKL versions.  The routine that contains the d_commit_trig_transform call is given below.  The error occurs regardless of the number of cores used, but usually after a few thousand calls.  Does anything look suspicious in the code below?  Any advice would be appreciated.

 

void TwoDCylRZPotSolver::RHSVectorDST()
{
    /*
     This method performs the discrete sine transform of the first kind (DST-I) on
     rhsvector in preparation to solve the linear tridiagonal system.  The transform
     is performed in chunk sizes of nz.  Due to the manner in which the DST is
     calculated, an input array (a) of size nz+2 must be used with a[0]=a[nz+1]=0,
     and a[1 to nz]=data.  A normalization factor of sqrt(2/(nz+1)) must be applied
     when copying the transformed data back into rhsvector.
     */
    
    double normfac=sqrt(2/double(nz+1));
    
#pragma omp parallel for
    for(int i=0; i<nrad; i++)
    {
        int error, ipar[128],n=nz+1,tt_type=0;
        double dpar[5*(nz+2)/2+2];
        DFTI_DESCRIPTOR_HANDLE handle = 0; //data structures used in transform
        
        double datatemp[nz+2];
        datatemp[0]=0;
        datatemp[nz+1]=0;
       
        d_init_trig_transform(&n,&tt_type,ipar,dpar,&error);
        d_commit_trig_transform(datatemp,&handle,ipar,dpar,&error);
 
        //copy data from rhsvector
        for(int j=0; j<nz; j++)
            datatemp[j+1]=rhsvector[i*nz+j];
        
        
        //perform transformation
        d_backward_trig_transform(datatemp,&handle,ipar,dpar,&error);
        
        //copy transformed data back to rhsvector
        for(int j=0; j<nz; j++)
            rhsvector[i*nz+j]=normfac*datatemp[j+1];
        
        free_trig_transform(&handle,ipar,&error);

        if(error != 0)
            cout<<"Error = "<<error<<" in free_trig_transform in method RHSVectorDST."<<endl;
    }
}

0 Kudos
6 Replies
Gennady_F_Intel
Moderator
132 Views

We didn't see such an issue with this version of mkl. Could you give the standalone example to reproduce/investigate the problem on our side!

132 Views

Hello Gennady,

Thanks for the reply.  I constructed a sample case (attached here) but it doesn't seem to fail.  However, in the code snippet I originally posted, almost all of the variables are created in the for loop scope, so it isn't very different.  One additional test I tried in the original code was to comment out the free_trig_transform() method.  If I do that then the error also doesn't occur, but the memory usage continually grows, as expected.  An example error output is :

WI_EIBT(57555,0x101d86dc0) malloc: Incorrect checksum for freed object 0x7fd8014064e8: probably modified after being freed.
Corrupt value: 0xbd9018dc3302c455
WI_EIBT(57555,0x101d86dc0) malloc: *** set a breakpoint in malloc_error_break to debug

Gennady_F_Intel
Moderator
132 Views

thanks for the example. Do you see the problem happens on some specific CPU and OS types?

 

Gennady_F_Intel
Moderator
132 Views

I see no problem on my side : win64, statically Linkin with mkl 2020, openmp lnking.

AVX2 systems

Starting loop #1
Starting loop #2
Starting loop #3
Starting loop #4
....

Starting loop #1353966
Starting loop #1353967
Starting loop #1353968
Starting loop #1353969
Starting loop #1353970
 

132 Views

The full code fails (after some time) on two platforms I've tested on, Mac (MKL 2020) and linux (MKL 2019) dynamically linked with openMP.  More interesting behavior I've just found is that if I declare ipar as int* ipar = new int[128] and DON'T delete it at the end of the loop, the code runs (with the expected memory growth) until I kill it.  If I "delete[] ipar" at the end of the loop, then the code fails as before.  Ipar should not be touched by anything outside of the loop it's declared in

132 Views

In case you have the chance to try the full code, I'm attaching it here.  WI_EIBT.cpp is the main file.  Once compiled you place it in the same directory as the other files included in the archive (*.config, *.dat).  Execute the code via "./WI_EIBT WI_EIBT_100k_SC_dt2_6480V.config".  It will create a directory named "WI_EIBT_100k_SC_dt2_6480V" and start writing multiple datafiles in that directory and will also print status updates to the terminal.  It errors shortly after starting.

Reply