Very poor scaling of xhpl from MKL on a Skylake Gold system

Reddy__Raghu · ‎01-08-2019

Briefly:
Trying to run LINPACK using: /apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic

I am including the full path to include version information.

On a single node the performance is very good, but the efficiency falls off very quickly.

Filtered results are shown below, hopefully it has all the information needed, but please feel free to ask for clarifications;

Number of nodes used is the first column.

Mem is approximately the amount of mem/node used for that problem

The next few columns are from HPL output.

The last column simply divides Tflops number by the expected peak for that many nodes. The

Single node runs with different MPI tasks/Threads combinations:
v288.% grep -H ^WR HPCC-xhpl_intel64_dynamic-001* | ~/S2/TO-05/hpl/hpcc-eta
Nodes Mem:           N    NB     P     Q               Time          Tflops   Efficiency
   001 70gb:       93312   384     1     1             240.97               2.2 109.8%
   001 70gb:       93312   384     1     2             274.15               2.0   96.5%
   001 70gb:       93312   384     2     4             414.74               1.3   63.8%
v288.%

Two node runs, with different combinations of MPI Tasks/threads:

v288.% grep -H ^WR HPCC-xhpl_intel64_dynamic-002* | ~/S2/TO-05/hpl/hpcc-eta
Nodes Mem:           N    NB     P     Q               Time          Tflops   Efficiency
   002 35gb:       93312   384     2     2             146.27               3.7   90.4%
   002 35gb:       93312   384     4     4             302.49               1.8   43.7%
v288.%

16 node runs:
v288.% grep -H ^WR HPCC-xhpl_intel64_dynamic-016* | ~/S2/TO-05/hpl/hpcc-eta
Nodes Mem:           N    NB     P     Q               Time          Tflops   Efficiency
   016 20gb:      198912   336     4     8            1040.38               5.0   15.4%
   016 20gb:      199680   384     4     8            1163.11               4.6   13.9%
   016 70gb:      371712   384     4     8            4386.57               7.8   23.8%
   016 70gb:      307200   384    20    32            5433.52               3.6   10.9%
v288.%

Each node has 2 sockets of 20 core Skylake Gold processors with EDR IB network.

We are not seeing this problem with the HPCC benchmark. We would like to use the Intel MKL version to get better performance:
With HPCC code,: Please ignore the second column below:

v288.% grep -H ^WR HPCC-A-2-40-5x16* | ~/S2/TO-05/hpl/hpcc-eta
Nodes Mem:           N    NB     P     Q               Time          Tflops   Efficiency
     2 256:      132096   256     5    16             476.47             3.2   78.7%
     2 384:      132096   384     5    16             457.87             3.4   81.9%
v288.%

v288.% grep -H ^WR HPCC-A-64-** | ~/S2/TO-05/hpl/hpcc-eta
Nodes Mem:           N    NB     P     Q               Time          Tflops   Efficiency
    64 256:      748032   256    40    64            3378.20            82.6   63.0%
    64 384:      748032   384    40    64            3016.94            92.5   70.6%
v288.%

Appreciate any suggestions on things that we done to get better performance!

Thanks!