- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I want an in place memory transpose of very large matrix. I am using mkl_simatcopy. But I am observing some performance issue while transposing inplace. I am currently using Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz having 72 physical cores and redhat os.
My observation is that, when I perform transpose operation, only single core is used and it is not using all cores. I have tried all environment variables like MK_NUM_THREADS, MKL_DYNAMIC="FALSE" etc. My compilation script is as follows :
gcc -std=c99 -m64 -I $MKLROOT/include transpose.c ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_cdft_core.a ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_tbb_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_openmpi_ilp64.a -Wl,--end-group -lstdc++ -lpthread -lm -ldl -o transpose.out
Timings obtained are as follows
Sno. No. of Rows No. of Cols Time(in sec)
1 16384 8192 16
2 16384 32768 68
3 32768 65536 233
Data Type is float. Please let me know , if there is an efficient way to transpose inplace or how can we port to multiple cores or how can we reduce this execution time.
Below is code snippet of transpose.c:
int main(int argc,char *argv[])
{
if(argc!=3)
{
printf("Usage : exe NoofScan and NoofPix \n");
exit(0);
}
unsigned long noOfScan = atol(argv[1]);
unsigned long noOfPix = atol(argv[2]);
printf("----->>>> noOfScan = %d and noOfPix =%d \n",noOfScan,noOfPix);
size_t nEle = noOfScan * noOfPix;
float *data = (float *)calloc(nEle,sizeof(float));
initalizeData(data,noOfScan,noOfPix);
long nt = mkl_get_max_threads();
printf("No Of threads are = %d \n",nt);
mkl_set_num_threads_local(nt);
//mkl_set_num_threads(nt);
double time1 = cpuSecond();
mkl_simatcopy('R','T',noOfScan,noOfPix,1,data,noOfPix,noOfScan);
printf("Time elapsed is %lf \n",cpuSecond()-time1);
memset(data,0,nEle*sizeof(float));
free(data);
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The bulk of the work of forming the transpose is being performed in a library subroutine. What will matter as far as performance is concerned is whether/how well the library subroutine is parallelized. Your changing compiler options (or making efforts to parallelize the code from which the transpose routine is called) can have not any effect on the performance of the library subroutine.
If the MKL library that you use contains a parallel version of mkl_simatcopy(), its run time can be affected by setting MKL_NUM_THREADS, etc. However, the timings that you reported indicate that the time taken by the routine is proportional to the number of elements in the matrix being transposed, which is exactly what one expects from a serial version of the routine.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there any parallel version of mkl_simatcopy??
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
not. You may submit the feature request regard to this topic to the intel online service center.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page