- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm encountering a problem when trying to measure socket performance of a Xeon E5 v3 chip with COD active but the problem also persists when I try to run on two sockets of a 2S Xeon-EP node. I am using the latest benchmark from the intel.com website (l_mklb_p_2017.1.013) and am following the advice from https://software.intel.com/en-us/node/599526.
I was trying running
I_MPI_ADJUST_ALLREDUCE=5 mpiexec.hydra -n 2 env OMP_NUM_THREADS=18 KMP_AFFINITY=granularity=fine,compact,1,0 bin/xhpcg_avx2 --n=168 on a 2S E5-2697 v4 with COD deactivated and;
mpiexec.hydra -genv I_MPI_PIN_DOMAIN node -genv I_MPI_PI disable -np 1 -env OMP_NUM_THREADS 7 -env KMP_AFFINITY 'verbose,granularity=fine,proclist=[0,1,2,3,4,5,6],explicit' ../l_mklb_p_2017.1.013/hpcg/mybins/xhpcg_avx2_mpi --n=128 --t=0 : -np 1 -env OMP_NUM_THREADS 7 -env KMP_AFFINITY 'verbose,granularity=fine,proclist=[7,8,9,10,11,12,13],explicit' ../l_mklb_p_2017.1.013/hpcg/mybins/xhpcg_avx2_mpi --n=128 --t=0 to evaluate the socket performance of a 14-core Haswell-EP with COD active.
The problem is that xhpcg does not finish when running with more than one MPI process per node. When running one process per node and setting OMP_NUM_THREADS to x I can see x*100% CPU load for that process in top (I know top probably isn't the best tool to estimate core utilisation for a memory-bound application, but it's a good enough indicator); if I use more than one MPI process, I see the cpu utilisation dropping to 100% for each MPI process instead of x*100%.
I tried some debugging but I'm neither an MPI nor HPCG expert, so some help would be appreciated. I set
HPCG_OPTS = -DHPCG_DEBUG -DHPCG_DETAILED_DEBUG and compiled a new binary. If I run with two MPI processes I get two files, the last entries of each are:
broadep2:IMPI_IOMP_AVX2 iwi325$ tail hpcg_log_n168_2p_1t_2016.11.24.20.16.15.txt
Process 0 of 2 has 9261 rows.
Process 0 of 2 has 230702 nonzeros.
Process 0 of 2 has 4741632 rows.
Process 0 of 2 has 126758012 nonzeros.
Process 0 of 2 has 592704 rows.
Process 0 of 2 has 15687500 nonzeros.
Process 0 of 2 has 74088 rows.
Process 0 of 2 has 1922000 nonzeros.
Process 0 of 2 has 9261 rows.
Process 0 of 2 has 230702 nonzeros.
broadep2:IMPI_IOMP_AVX2 iwi325$ tail hpcg_log_n168_2p_1t_1_2016.11.24.20.16.15.txt
Process 1 of 2 has 9261 rows.
Process 1 of 2 has 230702 nonzeros.
Process 1 of 2 has 4741632 rows.
Process 1 of 2 has 126758012 nonzeros.
Process 1 of 2 has 592704 rows.
Process 1 of 2 has 15687500 nonzeros.
Process 1 of 2 has 74088 rows.
Process 1 of 2 has 1922000 nonzeros.
Process 1 of 2 has 9261 rows.
Process 1 of 2 has 230702 nonzeros.
Maybe there's an MPI_barrier and they're waiting for a third non-existend MPI process?
Any help would be appreciated.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I could narrow the issue down to the MPI runtime version.
composer_xe_2015.1.133 works;
2016.3.210 does not.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page