HPCG and mpirun on 2S Xeon-EP

JJoha8 — Thu, 24 Nov 2016 19:29:36 GMT

Hi,

I'm encountering a problem when trying to measure socket performance of a Xeon E5 v3 chip with COD active but the problem also persists when I try to run on two sockets of a 2S Xeon-EP node. I am using the latest benchmark from the intel.com website (l_mklb_p_2017.1.013) and am following the advice from https://software.intel.com/en-us/node/599526.

I was trying running

I_MPI_ADJUST_ALLREDUCE=5 mpiexec.hydra -n 2 env OMP_NUM_THREADS=18 KMP_AFFINITY=granularity=fine,compact,1,0 bin/xhpcg_avx2 --n=168 on a 2S E5-2697 v4 with COD deactivated and;
mpiexec.hydra -genv I_MPI_PIN_DOMAIN node -genv I_MPI_PI disable -np 1 -env OMP_NUM_THREADS 7 -env KMP_AFFINITY 'verbose,granularity=fine,proclist=[0,1,2,3,4,5,6],explicit' ../l_mklb_p_2017.1.013/hpcg/mybins/xhpcg_avx2_mpi --n=128 --t=0 : -np 1 -env OMP_NUM_THREADS 7 -env KMP_AFFINITY 'verbose,granularity=fine,proclist=[7,8,9,10,11,12,13],explicit' ../l_mklb_p_2017.1.013/hpcg/mybins/xhpcg_avx2_mpi --n=128 --t=0 to evaluate the socket performance of a 14-core Haswell-EP with COD active.

The problem is that xhpcg does not finish when running with more than one MPI process per node. When running one process per node and setting OMP_NUM_THREADS to x I can see x*100% CPU load for that process in top (I know top probably isn't the best tool to estimate core utilisation for a memory-bound application, but it's a good enough indicator); if I use more than one MPI process, I see the cpu utilisation dropping to 100% for each MPI process instead of x*100%.

I tried some debugging but I'm neither an MPI nor HPCG expert, so some help would be appreciated. I set
HPCG_OPTS = -DHPCG_DEBUG -DHPCG_DETAILED_DEBUG and compiled a new binary. If I run with two MPI processes I get two files, the last entries of each are:
broadep2:IMPI_IOMP_AVX2 iwi325$ tail hpcg_log_n168_2p_1t_2016.11.24.20.16.15.txt
Process 0 of 2 has 9261 rows.
Process 0 of 2 has 230702 nonzeros.
Process 0 of 2 has 4741632 rows.
Process 0 of 2 has 126758012 nonzeros.
Process 0 of 2 has 592704 rows.
Process 0 of 2 has 15687500 nonzeros.
Process 0 of 2 has 74088 rows.
Process 0 of 2 has 1922000 nonzeros.
Process 0 of 2 has 9261 rows.
Process 0 of 2 has 230702 nonzeros.

broadep2:IMPI_IOMP_AVX2 iwi325$ tail hpcg_log_n168_2p_1t_1_2016.11.24.20.16.15.txt
Process 1 of 2 has 9261 rows.
Process 1 of 2 has 230702 nonzeros.
Process 1 of 2 has 4741632 rows.
Process 1 of 2 has 126758012 nonzeros.
Process 1 of 2 has 592704 rows.
Process 1 of 2 has 15687500 nonzeros.
Process 1 of 2 has 74088 rows.
Process 1 of 2 has 1922000 nonzeros.
Process 1 of 2 has 9261 rows.
Process 1 of 2 has 230702 nonzeros.

Maybe there's an MPI_barrier and they're waiting for a third non-existend MPI process?

Any help would be appreciated.

I could narrow the issue down

JJoha8 — Fri, 25 Nov 2016 11:14:09 GMT

I could narrow the issue down to the MPI runtime version.

composer_xe_2015.1.133 works;
2016.3.210 does not.

topic I could narrow the issue down in Intel® oneAPI Math Kernel Library

HPCG and mpirun on 2S Xeon-EP

I could narrow the issue down