Understanding bad performances for an offloaded hybrid MPI-OpenMP application

mguy44 · ‎10-30-2015

Hello,

I have ported my application on nodes with two Intel's Xeon Phi cards. I notice that performances are very disappointing.
As it is a MPI application, I have to give some more informations about how it works (sorry for the long text).

MPI parallelization is done with a classical 3D domain decomposition using a cartesian grid of subdomains (one process per subdomain). They have ghost cells (26 neighbours) which need to be refreshed several times per time iteration (explicit multi step scheme in time).

Next, hybridation is done with a large OpenMP parallel region and quite everything is done in parallel through ''openmp collapsed'' nested loops. On a cluster of multicore nodes, everything's running fine.

With the technique of offloading, one has to change things. As MPI communications are done on the host server, one has to transfer data between the host and the MIC (the goal is to use several servers with MIC cards). To minimize this amount of data, I create buffer arrays that I fill with ghost cells values for where there are neighbours. And only these buffer arrays commute between each MIC and the host. All large data arrays, are copied on the MIC at the beginning and stay there till the end. There is no more one large OpenMP parallel region but several OpenMP offloaded regions, only with the MPI communications between them. Everything is computed in parallel in these regions.

I check ifort's report to verify that every loops are vectorized. I put options ans directives so that data are correctly aligned.

I pick up a mesh (800x300x100 cells) that fills memory of two MIC cards (13 GB) in order to compare performances between the three versions of the code. On a 20 cores node (Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz) with two MIC (5110p) cards, I get :
on node, MPI (20 processes) : 1195 sec, CPI rate : 0.83
on node, hybrid MPI/OpenMP (4 processes, each with 5 threads) : 1460 sec, CPI rate : 0.77
Offload MPI/OpenMP (4 processes, each with 118 offloaded threads, 2x118 threads on each MIC) : 2715 sec, CPI rate : 4.0

With the offloaded version, I try several combinaisons (number of processes, number of threads per processes) and this one seems to give the best results I can get.

I use Vtune to profile the behaviour of the application, and I notice quite large time consumed by the system or external libraries (please see attached snapshots). Moreover, CPI rate of time-consuming routines are worse on the MIC than on the host.
Could you please give me some advices of what I should check in my application ?

Thanks in advance.

Guy.

P.S. : I took wrong data for elapsed times, it's now correct.

TimP · ‎10-30-2015

If I understand correctly, you aren't using offload mode, only plain mpi "symmetric " mode. You will likely need to optimize number of ranks and threads on phi experimentally. It would likely run better at 3 threads per core, so you would require kmp_affinity = balanced to spread threads evenly across cores and pin threads to cores.

Your micsmc GUI views showing the core loading would be interesting.

If you cut back on threads per core, your overall views of system activity in vtune will look worse so you will need to select the threads which are performance critical.

mguy44 · ‎10-30-2015

Hello Tim,

If I read the models found on Intel's web site :

The Intel® MPI Library supports the Intel® Xeon Phi™ coprocessor in 3 major ways:

The offload model where all MPI ranks are run on the main Xeon host, and the application utilizes offload directives to run on the Intel Xeon Phi corpocessor card,

The native model where all MPI ranks are run on the Intel Xeon Phi coprocessor card, and

The symmetric model where MPI ranks are run on both the Xeon host and the Xeon Phi coprocessor card.

I really use the first one, offload model. There are no MPI processes on the Xeon Phi coprocessor card.

I try what you have proposed : using 177 threads per MPI process and I keep 4 MPI processes.

Here is the output of MICSMC with the core loading.

I will look at VTune output ASAP, but all the OpenMP threads do the same

TimP · ‎10-30-2015

Then you need to set up your mpirun to pin the threads of each rank to a distinct group of cores e.g.

-genv mic_env_prefix=phi

Rank 0: -env phi_kmp_place_threads=0o,15c,3t

Rank 1: 15o,15c,3t

...

Colfax training did cite a case where mpi offload was said to outperform symmetric on knc. Still much experiment is required with attention to detail.

From what you said, I couldn't guess how you have been setting num_threads. Maybe omp parallel num_threads? Having multiple copies of openmp sharing all Mic cores no doubt looks "interesting " in vtune.

TimP · ‎10-30-2015

I guess if just 2 mpi ranks are offloading to each mic, it might be 0o,30c,3t and 30o,30c,3t for 90 threads per rank.

I don't know all the.reasons, but offload often peaks performance at lower threads per core.

TimP · ‎10-30-2015

Micsmc bar graphs appear to show only 1 thread running per core.

mguy44 · ‎10-30-2015

Thank you for your advices. I set the running environment with the following variables :

export MIC_ENV_PREFIX=MIC
export MIC_OMP_NESTED=1
export MIC_OMP_NUM_THREADS=118
export MIC_OMP_MAX_ACTIVE_LEVELS=2
export MIC_KMP_AFFINITTY=scatter

I'll look at the pinning strategy you suggest.

TimP · ‎10-31-2015

As you said you used omp collapse, it is surprising now that you say you are using nested openmp. I suspect then the kmp_place_threads shortcut may not work. It does appear that your current affinity assignment of multiple thread pools to the same subset of physical thread contexts is a large part of the problem.

mguy44 · ‎10-31-2015

Hello Tim,

maybe I didn't explain correctly things and I apologize for this. Here is an example from my application. The beginning of an offloaded region with the first nested loops that are parallelized using OpenMP directive OMP DO with COLLAPSE attribute :

!dir$ omp offload begin target(mic:my_mic)
     &         in     (Imin, Imax, Jmin, Jmax, Kmin, Kmax)
     &         in     (iVisc, iTurb, iInvF, iENO)
     &         in     (Gm1, Da, Db, Dc, Sg, Re, Prt, Cs)
     &         nocopy (Q, QP :REUSE,RETAIN)
     &         nocopy (CXq, CYq, CZq :REUSE,RETAIN)
     &         nocopy (Xa, Yb, Zc :REUSE,RETAIN)
     &         nocopy (RQ :REUSE,RETAIN)
     &         nocopy (g_QPn, DELT, distwall, Csd:REUSE,RETAIN)
     &         nocopy (TV_i, sensor :REUSE,RETAIN)
!$OMP PARALLEL DEFAULT (NONE)
!$OMP&SHARED  (Q, TV_i, RQ, Xa, Yb, Zc, Gm1, Da, Db, Dc, Sg, Re, Prt)
!$OMP&SHARED  (QP, CXq, CYq, CZq, g_QPn, sensor)
!$OMP&SHARED  (Kmin, Kmax, Jmin, Jmax, Imin, Imax, iInvF, iENO)
!$OMP&SHARED  (ivisc, iTurb)
!$OMP&PRIVATE (i, j, k, L)
!$OMP DO COLLAPSE(3)
      DO k = Kmin, Kmax
         DO j = Jmin, Jmax
            DO L = 1, 5
               DO i = Imin, Imax
                  RQ(i,L,j,k) = zero
               END DO
            END DO
         END DO
      END DO
!$OMP END DO

...

...

!$OMP END PARALLEL
!dir$ end offload

mguy44 · ‎11-06-2015

Hello,

I recently installed advisor_xe and thanks to it, I have found some ways to improve vectorization of a key routine. Now, it takes less than 2000 sec. to do the job, very interesting but still far away from the MPI version.