- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
I want an in place memory transpose of very large matrix. I am using mkl_simatcopy. But I am observing some performance issue while transposing inplace. I am currently using Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz having 72 physical cores and redhat os.
My observation is that, when I perform transpose operation, only single core is used and it is not using all cores. I have tried all environment variables like MK_NUM_THREADS, MKL_DYNAMIC="FALSE" etc. My compilation script is as follows :
gcc -std=c99 -m64 -I $MKLROOT/include transpose.c ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_cdft_core.a ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_tbb_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_openmpi_ilp64.a -Wl,--end-group -lstdc++ -lpthread -lm -ldl -o transpose.out
Timings obtained are as follows
Sno. No. of Rows No. of Cols Time(in sec)
1 16384 8192 16
2 16384 32768 68
3 32768 65536 233
Data Type is float. Please let me know , if there is an efficient way to transpose inplace or how can we port to multiple cores or how can we reduce this execution time.
Below is code snippet of transpose.c:
int main(int argc,char *argv[])
{
if(argc!=3)
{
printf("Usage : exe NoofScan and NoofPix \n");
exit(0);
}
unsigned long noOfScan = atol(argv[1]);
unsigned long noOfPix = atol(argv[2]);
printf("----->>>> noOfScan = %d and noOfPix =%d \n",noOfScan,noOfPix);
size_t nEle = noOfScan * noOfPix;
float *data = (float *)calloc(nEle,sizeof(float));
initalizeData(data,noOfScan,noOfPix);
long nt = mkl_get_max_threads();
printf("No Of threads are = %d \n",nt);
mkl_set_num_threads_local(nt);
//mkl_set_num_threads(nt);
double time1 = cpuSecond();
mkl_simatcopy('R','T',noOfScan,noOfPix,1,data,noOfPix,noOfScan);
printf("Time elapsed is %lf \n",cpuSecond()-time1);
memset(data,0,nEle*sizeof(float));
free(data);
}
Link copiado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
The bulk of the work of forming the transpose is being performed in a library subroutine. What will matter as far as performance is concerned is whether/how well the library subroutine is parallelized. Your changing compiler options (or making efforts to parallelize the code from which the transpose routine is called) can have not any effect on the performance of the library subroutine.
If the MKL library that you use contains a parallel version of mkl_simatcopy(), its run time can be affected by setting MKL_NUM_THREADS, etc. However, the timings that you reported indicate that the time taken by the routine is proportional to the number of elements in the matrix being transposed, which is exactly what one expects from a serial version of the routine.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Is there any parallel version of mkl_simatcopy??
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
not. You may submit the feature request regard to this topic to the intel online service center.

- Subscrever fonte RSS
- Marcar tópico como novo
- Marcar tópico como lido
- Flutuar este Tópico para o utilizador atual
- Marcador
- Subscrever
- Página amigável para impressora