MKL Rectangular matrix Inplace transpose performance issue

Gupta__Shubham1 · ‎05-13-2019

I want an in place memory transpose of very large matrix. I am using mkl_simatcopy. But I am observing some performance issue while transposing inplace. I am currently using Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz having 72 physical cores and redhat os.

My observation is that, when I perform transpose operation, only single core is used and it is not using all cores. I have tried all environment variables like MK_NUM_THREADS, MKL_DYNAMIC="FALSE" etc. My compilation script is as follows :

gcc -std=c99    -m64 -I $MKLROOT/include transpose.c ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_cdft_core.a ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_tbb_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_openmpi_ilp64.a -Wl,--end-group -lstdc++ -lpthread -lm -ldl -o transpose.out

Timings obtained are as follows

Sno.               No. of Rows        No. of Cols     Time(in sec)
1                          16384               8192            16
2                          16384               32768          68
3                          32768               65536          233

Data Type is float. Please let me know , if there is an efficient way to transpose inplace or how can we port to multiple cores or how can we reduce this execution time.

Below is code snippet of transpose.c:

int main(int argc,char *argv[])
{
        if(argc!=3)
        {
                printf("Usage : exe NoofScan and NoofPix \n");
                exit(0);
        }
        unsigned long noOfScan = atol(argv[1]);
        unsigned long noOfPix = atol(argv[2]);
        printf("----->>>> noOfScan = %d and noOfPix =%d \n",noOfScan,noOfPix);
        size_t nEle = noOfScan * noOfPix;

        float *data = (float *)calloc(nEle,sizeof(float));
        initalizeData(data,noOfScan,noOfPix);
long nt = mkl_get_max_threads();
        printf("No Of threads are = %d \n",nt);
        mkl_set_num_threads_local(nt);
        //mkl_set_num_threads(nt);
        double time1 = cpuSecond();
        mkl_simatcopy('R','T',noOfScan,noOfPix,1,data,noOfPix,noOfScan);
        printf("Time elapsed is %lf \n",cpuSecond()-time1);
        memset(data,0,nEle*sizeof(float));
        free(data);
}

mecej4 · ‎05-13-2019

The bulk of the work of forming the transpose is being performed in a library subroutine. What will matter as far as performance is concerned is whether/how well the library subroutine is parallelized. Your changing compiler options (or making efforts to parallelize the code from which the transpose routine is called) can have not any effect on the performance of the library subroutine.

If the MKL library that you use contains a parallel version of mkl_simatcopy(), its run time can be affected by setting MKL_NUM_THREADS, etc. However, the timings that you reported indicate that the time taken by the routine is proportional to the number of elements in the matrix being transposed, which is exactly what one expects from a serial version of the routine.

Gupta__Shubham1 · ‎05-13-2019

Is there any parallel version of mkl_simatcopy??

Gennady_F_Intel · ‎05-14-2019

not. You may submit the feature request regard to this topic to the intel online service center.