Linpack Performance problem over V3 Processors

Reza_M_1 · ‎07-29-2015

Hello Dears,

I have a project with two type of computing nodes, Xeon V2 processors (16 nodes) and V3 processors(64 nodes). Installed Intel Parallel studio in one of the V2 computing nodes and got very good results by Linpack (92%) over 16 V2 computing nodes (all in one blade enclosure ) ,later recompiled the intel Paralle Studio over V3 processors and executed same benchmark over new Xeon v3 processors but the results reduced to 75% more or less. I tried single (V3 type) computing node and got about 87% efficiency but when it goes over all computing nodes (or even 16 inside blade chassis )the results will drop to 74%.

I guess maybe there is a network problem but it's blade chassis with internal infiniband network switch so it's not easy to suspect of network.

here is HPL.dat configuration :

N        : 667200
NB       :     192
PMAP     : Column-major process mapping
P        :      16
Q        :      24
PFACT    :   Right
NBMIN    :       4
NDIV     :       2
RFACT    :   Crout
BCAST    :   1ring
DEPTH    :       0
SWAP     : Binary-exchange
L1       : no-transposed form
U        : no-transposed form
EQUIL    : no
ALIGN    :    8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

Column=003456 Fraction=0.005 Mflops=11203980.43
Column=006720 Fraction=0.010 Mflops=10914878.75
Column=010176 Fraction=0.015 Mflops=10727462.35
Column=013440 Fraction=0.020 Mflops=10716597.44
Column=016704 Fraction=0.025 Mflops=10572442.04

.

T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00C2R4      667200   192    16    24           19331.88            1.02425e+04
HPL_pdgesv() start time Mon Jul 27 06:33:08 2015

HPL_pdgesv() end time   Mon Jul 27 11:55:20 2015

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0008648 ...... PASSED
----------------------------------------------------------------------------------------------------------

I even tried different N numbers but the results are same.

I appreciate your advise,

Best Regards,

Reza

Murat_G_Intel · ‎07-29-2015

Hello Reza,

It looks like your problem size per mpi process is around 8 GB. If you have more than 8GB per MPI rank, you can increase the problem size. For example, you can use N = 1,000,000 if you have 32GB memory available to each MPI rank.

Also, which MP LINPACK version are you using? It's recommended to use offload version to get the best performance from MP LINPACK.

Thank you.

Reza_M_1 · ‎07-29-2015

Hello Murat,

Thanks for reply, system configuration per node is as below :

CPU : 2* Xeon 2690 V3

Memeory : 256 GB

Therefore, each node has 24 core which means I have about 10 GB memory per MPI Rank. I used 80% of memory for this test, I also used different number of N but the mentioned number is out of memory.

Please adise,

Regards,

Reza

VipinKumar_E_Intel · ‎07-30-2015

Reza,

Have you tried tweaking the block size NB? Are you having the Xeon Phi cards as well?

Vipin

Reza_M_1 · ‎07-30-2015

Hi Vipin,

I used NB up to 512, I use HPL calculator which suggested to use up to 256. should I change it ?I dont have Xeon Phi.

today just tried one more blade enclosure (16 nodes) and I got 87% with same configuration, therefore it's important to know why same benchmark give less performance in some blades.

Reza

Reza_M_1 · ‎07-30-2015

There are 5 enclosures, one is V2 processors and all 4 others are v3 processor type. Based on my suggestion customer purchased Intel Parallel studio, I compiled and used mp_linpack with different configuration for both version of CPUs. Here is the quick report :

Enclosure1 ( 16 nodes, 2* 2690 V2/node, 256 GB memory /node) = 92% efficiency

Enclosure2 ( 16 nodes, 2* 2690 V3/node, 256 GB memory /node) = 74% efficiency

Enclosure3 ( 16 nodes, 2* 2690 V3/node, 256 GB memory /node) = 74% efficiency

Enclosure4 ( 16 nodes, 2* 2690 V3/node, 256 GB memory /node) = ?? % efficiency (still running without any results for long time which is not normal)

Enclosure5 ( 16 nodes, 2* 2690 V3/node, 256 GB memory /node) = 86% efficiency

As you can see I got 87% efficiency for enclosure 5 but with same configuration and same binary other V3 type enclosures got very low performance, I tested all individual computing nodes and their sigle performance is in the 85~ 90% efficiency , therefore based on my experience I am suspecting of some problems of IB switches (inside the blade). Otherwise everything is same and no technical reason for this difference results.

Enclosure 4 is still running the benchmark without any results which is not normal, I cancelled and did run the benchmark couple of time but always without any results in screen(just printing benchmark details on screen ).

Do you also suspecting network performance?

Reza_M_1 · ‎08-01-2015

Hello,

Please review latest test, I just removed some nodes with less performance :

why in latest fractions performance decreased ?

[root@hpc064 intel64-en2]#mpirun -genv I_MPI_FABRICS shm:ofa --perhost 24 -f hostv3-en2 -n 288 ./xhpl
================================================================================
HPLinpack 2.1 -- High-Performance Linpack benchmark --   October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N        : 513536
NB       :     256
PMAP     : Column-major process mapping
P        :      16
Q        :      18
PFACT    :   Right
NBMIN    :       4
NDIV     :       2
RFACT    :   Crout
BCAST    :   1ring
DEPTH    :       0
SWAP     : Binary-exchange
L1       : no-transposed form
U        : no-transposed form
EQUIL    : no
ALIGN    :    8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

Column=002816 Fraction=0.005 Mflops=9061105.31
Column=005376 Fraction=0.010 Mflops=9046932.22
Column=007936 Fraction=0.015 Mflops=9058204.79
Column=010496 Fraction=0.020 Mflops=9050563.70
Column=013056 Fraction=0.025 Mflops=9049741.88
Column=015616 Fraction=0.030 Mflops=9039969.70
Column=018176 Fraction=0.035 Mflops=9047304.00
Column=020736 Fraction=0.040 Mflops=9041378.08
Column=023296 Fraction=0.045 Mflops=9046721.77
Column=025856 Fraction=0.050 Mflops=9041185.93
Column=028416 Fraction=0.055 Mflops=9041261.15
Column=030976 Fraction=0.060 Mflops=9043317.44
Column=033536 Fraction=0.065 Mflops=9037247.78
Column=036096 Fraction=0.070 Mflops=9040218.57
Column=038656 Fraction=0.075 Mflops=9036754.32
Column=041216 Fraction=0.080 Mflops=9038795.75
Column=043776 Fraction=0.085 Mflops=9034784.48
Column=046336 Fraction=0.090 Mflops=9036850.03
Column=048896 Fraction=0.095 Mflops=9033007.65
Column=051456 Fraction=0.100 Mflops=9031942.34
Column=054016 Fraction=0.105 Mflops=9031955.22
Column=056576 Fraction=0.110 Mflops=9029712.93
Column=059136 Fraction=0.115 Mflops=9029494.84
Column=061696 Fraction=0.120 Mflops=9026757.86
Column=064256 Fraction=0.125 Mflops=9027979.55
Column=066816 Fraction=0.130 Mflops=9025197.74
Column=069376 Fraction=0.135 Mflops=9025893.78
Column=071936 Fraction=0.140 Mflops=9023208.61
Column=074496 Fraction=0.145 Mflops=9021544.10
Column=077056 Fraction=0.150 Mflops=9021935.55
Column=079616 Fraction=0.155 Mflops=9019616.16
Column=082176 Fraction=0.160 Mflops=9018907.89
Column=084736 Fraction=0.165 Mflops=9016691.08
Column=087552 Fraction=0.170 Mflops=9016819.55
Column=090112 Fraction=0.175 Mflops=9014083.51
Column=092672 Fraction=0.180 Mflops=9012658.02
Column=095232 Fraction=0.185 Mflops=9012202.90
Column=097792 Fraction=0.190 Mflops=9010632.84
Column=100352 Fraction=0.195 Mflops=9010460.85
Column=102912 Fraction=0.200 Mflops=9008019.04
Column=105472 Fraction=0.205 Mflops=9007770.80
Column=108032 Fraction=0.210 Mflops=9005530.13
Column=110592 Fraction=0.215 Mflops=9005546.17
Column=113152 Fraction=0.220 Mflops=9003123.45
Column=115712 Fraction=0.225 Mflops=9001852.54
Column=118272 Fraction=0.230 Mflops=9001127.23
Column=120832 Fraction=0.235 Mflops=8999577.23
Column=123392 Fraction=0.240 Mflops=8999016.71
Column=125952 Fraction=0.245 Mflops=8997200.79
Column=128512 Fraction=0.250 Mflops=8996267.96
Column=131072 Fraction=0.255 Mflops=8994390.42
Column=133632 Fraction=0.260 Mflops=8994104.68
Column=136192 Fraction=0.265 Mflops=8991677.98
Column=138752 Fraction=0.270 Mflops=8990688.20
Column=141312 Fraction=0.275 Mflops=8989904.19
Column=143872 Fraction=0.280 Mflops=8988248.44
Column=146432 Fraction=0.285 Mflops=8987440.56
Column=148992 Fraction=0.290 Mflops=8985408.19
Column=151552 Fraction=0.295 Mflops=8984725.40
Column=154112 Fraction=0.300 Mflops=8983018.04
Column=156672 Fraction=0.305 Mflops=8982367.39
Column=159232 Fraction=0.310 Mflops=8980273.27
Column=161792 Fraction=0.315 Mflops=8978828.22
Column=164352 Fraction=0.320 Mflops=8977845.23
Column=166912 Fraction=0.325 Mflops=8976481.14
Column=169472 Fraction=0.330 Mflops=8975433.75
Column=172288 Fraction=0.335 Mflops=8973274.72
Column=174848 Fraction=0.340 Mflops=8972909.28
Column=177408 Fraction=0.345 Mflops=8970859.30
Column=179968 Fraction=0.350 Mflops=8970341.78
Column=182528 Fraction=0.355 Mflops=8968542.31
Column=185088 Fraction=0.360 Mflops=8966975.75
Column=187648 Fraction=0.365 Mflops=8966226.69
Column=190208 Fraction=0.370 Mflops=8964635.98
Column=192768 Fraction=0.375 Mflops=8963576.43
Column=195328 Fraction=0.380 Mflops=8961711.48
Column=197888 Fraction=0.385 Mflops=8960941.17
Column=200448 Fraction=0.390 Mflops=8959099.64
Column=203008 Fraction=0.395 Mflops=8958368.70
Column=205568 Fraction=0.400 Mflops=8956500.55
Column=208128 Fraction=0.405 Mflops=8955047.72
Column=210688 Fraction=0.410 Mflops=8954196.09
Column=213248 Fraction=0.415 Mflops=8952260.15
Column=215808 Fraction=0.420 Mflops=8951470.19
Column=218368 Fraction=0.425 Mflops=8949359.59
Column=220928 Fraction=0.430 Mflops=8948666.39
Column=223488 Fraction=0.435 Mflops=8946977.25
Column=226048 Fraction=0.440 Mflops=8946186.36
Column=228608 Fraction=0.445 Mflops=8944377.15
Column=231168 Fraction=0.450 Mflops=8942976.66
Column=233728 Fraction=0.455 Mflops=8941888.74
Column=236288 Fraction=0.460 Mflops=8940228.88
Column=238848 Fraction=0.465 Mflops=8938968.04
Column=241408 Fraction=0.470 Mflops=8937085.38
Column=243968 Fraction=0.475 Mflops=8936307.90
Column=246528 Fraction=0.480 Mflops=8934694.85
Column=249088 Fraction=0.485 Mflops=8933727.46
Column=251648 Fraction=0.490 Mflops=8931890.27
Column=254208 Fraction=0.495 Mflops=8930422.32
Column=264704 Fraction=0.515 Mflops=8924848.87
Column=274944 Fraction=0.535 Mflops=8919322.46
Column=285184 Fraction=0.555 Mflops=8913957.59
Column=295424 Fraction=0.575 Mflops=8908322.24
Column=305664 Fraction=0.595 Mflops=8902968.88
Column=315904 Fraction=0.615 Mflops=8897533.79
Column=326144 Fraction=0.635 Mflops=8892400.35
Column=336384 Fraction=0.655 Mflops=8887327.38
Column=346880 Fraction=0.675 Mflops=8882110.90
Column=357120 Fraction=0.695 Mflops=8877203.68
Column=408320 Fraction=0.795 Mflops=8855163.35
Column=459776 Fraction=0.895 Mflops=8838741.82
Column=510976 Fraction=0.995 Mflops=8831508.36
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00C2R4      513536   256    16    18           10224.81            8.83015e+03
HPL_pdgesv() start time Sat Aug 1 07:22:34 2015

HPL_pdgesv() end time   Sat Aug 1 10:12:59 2015

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0010124 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Murat_G_Intel · ‎08-04-2015

Hi Reza,

The performance drop by the end is normal since the benchmark becomes less compute intensive towards the end. Do you see expected efficiency numbers now?

If you are using offload version, you can start with running 1 MPI per node (instead of 1 MPI per core). So, if you have 12 Xeons, you can use 12 MPI processes. After getting good numbers with 1 MPI per node, you can try running 1 MPI per CPU socket by using runme_offload_intel64 script.

Thank you.

Reza_M_1 · ‎08-04-2015

Hi Murat,

Thanks for reply, still working . by using runme_intel64 I can get good performance per node but customer needs total performance. when I using .xhpl inside mp_linpack then some nodes has lower performance and totally I got lower performance.

however some enclosure 's performance changing time to time from 75% to 85% which I dont know why, it may because of heating problem?

how can I change number of CPUs or sockets in runme_offload_intel64 ?

I appreciate your quick reply,

Regards,

Reza

Murat_G_Intel · ‎08-04-2015

Hi Reza,

Yes, it may be overheating/TDP issue. Please exclude those unstable nodes from the run until the hardware related problems are resolved.

For the runme_offload_intel64 script: if you have 12 nodes (2 socket each), please modify the lines below (runs 1 MPI process per socket):

31 export MPI_PROC_NUM=24

34 export MPI_PER_NODE=2

Also, please try to choose P and Q numbers close to each other. For the same example, there are 24 total MPI processes and you can try P=4, Q=6 (or P=6, Q=4).

Also, please try NB=192 for v3 systems.

Thank you.

MChun4 · ‎02-20-2017

Dear all,

I have a problem with the result of MKL MP_Linkpack. In my system, I have 24 compute nodes with both Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz and Xeon Phi Q7200, RAM 256GB. On each node, I run ./runme_intel64, the performance is good ~ 700-900 GFlops (only Xeon CPU).

But when I run HPL on 4 nodes, 8 nodes or more, the result is very bad, sometimes it cannot return the result with the error: MPI TERMINATED,... After that, I run the test (runme_intel64) on each node again, and the performance is very low:

~ 11,243 GFLops,

~ 10,845 GFlops,

....

But I don't know the reason why, I guess the reason is power of cluster (it is not enough for a whole system) and HPE Bios configured is Balanced Mode for the cluster (automatically change to lower power mode when the system cannot get enough the power). But when I just run on some nodes and configure the power is maximum, the problem is still not solved.

Please help me about this problem, thank you all!