Hi Miska, – 第 2 頁

Miska · ‎04-11-2013

Hello,

After some efforts, I have managed to port my simulation code to the Xeon Phi. It is usually runs with MPI on a cluster of PCs, but I wanted to see what kind of performance I would get from the Phi. The code relies quite a lot on MKL routines (Scalapack, MKL implementation of FFTW, a few GSL routines and so on).

I first ran the code as single threaded MPI (as I do on my cluster) with 10-60 MPI processes. I then tried to improve performance by adding threading through compiler option (my thinking was that although my code itself isn't threaded, the MKL routines might benefit from running multiple threads inside an MPI process and that the compiler could also add some paralellism). This brought some performance gain.

I am still however far from the speed I get on the cluster. I then ran Vtune Amplifier, to see if it could help me find out where the bottlenecks are. I would have thought that some of my functions would appear as the culprits, and that I could start improving from there. But no, the main bottlenecks are the MPI and threading libraries, vmlinux and mkl_core (see attached screen capture). I have tried to play with affinity and such, but it hasn't brought me much. I did optimize the ratio of threads and MPI processes - and that improved the performance somewhat.

So what does this mean ? Are the MPI calls not very efficient, and I should try to replace MPI calls by threading or re-think my paralellization scheme? How do I figure out which MPI calls are the most time consuming ? Or should I just concentrate on the functions which appear below (like phase2psfcube_float_function), and improving on them will also have on impact on the library calls above ?

Thanks in advance for your help and ideas !

Miska

TaylorIoTKidd · ‎08-26-2013

Hi Miska,

Though loopprofiler and VTune overlap, ITAC has capability (focused upon MPI) that neither loopprofiler nor VTune have.

VTune provides all the information that loopprofiler does. The argument for using loopprofiler is two fold. It comes with the compiler, is simplier to use, and is more focused on what it does than is VTune. VTune has over an order of magnitude greater capability than does loopprofiler, but sometimes all that capability just gets in the way for your initial analysis.

For more information on ITAC, see http://software.intel.com/en-us/intel-trace-analyzer.

Regards
--
Taylor

Miska · ‎08-26-2013

Thanks for the clarification Taylor !

Vtune performance analysis results on Phi