Intel Optimized MP Linpack Benchmark and standart HPL

Nikita_Tropin · ‎04-19-2009

Hello,

Recently I ran on our cluster with Windows Server 2003 Compute Cluster Edition two benchmarks:

1. Standart HPL benchmark 1.0a (from http://netlib.org/benchmark/hpl/index.html). I've compiled and linked it with Visual C++ compiler from makefile with Microsoft MPI and Intel MKL.

2. Intel Optimized MP Linpack benchmark (from http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download/). I've compiled it with Intel C++ compiler from makefile with Microsoft MPI and Intel MKL.

I've ran these benchmarks with the same input data, but second benchmark gave me result that was two times higher than first.

First benchmark gave me 62 Gflops, and Intel benchmark - 146 Gflops!

So the question is - are these benchmarks perform the same tests? Can I rely on results of second benchmark?

Output files differs a little:

Standart benchmark:

============================================================================
HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004
Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK
============================================================================

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 41408
NB : 32
PMAP : Row-major process mapping
P : 4
Q : 8
PFACT : Crout
NBMIN : 4
NDIV : 2
RFACT : Right
BCAST : 1ring
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 16 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
1) ||Ax-b||_oo / ( eps * ||A||_1 * N )
2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be 2.220446e-016
- Computational tests pass if scaled residuals are less than 16.0

============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR10R2C4 41408 32 4 8 753.96 6.278e+001
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0133256 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0070420 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0012459 ...... PASSED
============================================================================

Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
----------------------------------------------------------------------------

End of Tests.
============================================================================

Intel benchmark:

================================================================================
HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 41408
NB : 32
PMAP : Row-major process mapping
P : 4
Q : 8
PFACT : Crout
NBMIN : 4
NDIV : 2
RFACT : Right
BCAST : 1ring
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 16 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 2.220446e-016
- Computational tests pass if scaled residuals are less than 16.0

Column=000224 Fraction=0.005 Mflops=197242.13
Column=000416 Fraction=0.010 Mflops=185023.62
Column=000640 Fraction=0.015 Mflops=179754.12
Column=000832 Fraction=0.020 Mflops=174426.48
Column=001056 Fraction=0.025 Mflops=170493.51
Column=001248 Fraction=0.030 Mflops=169773.02
Column=001472 Fraction=0.035 Mflops=169746.92
Column=001664 Fraction=0.040 Mflops=157461.41
Column=001888 Fraction=0.045 Mflops=147856.56
Column=002080 Fraction=0.050 Mflops=149151.04
Column=002304 Fraction=0.055 Mflops=150093.02
Column=002496 Fraction=0.060 Mflops=150901.56
Column=002720 Fraction=0.065 Mflops=151617.65
Column=002912 Fraction=0.070 Mflops=151555.89
Column=003136 Fraction=0.075 Mflops=152465.69
Column=003328 Fraction=0.080 Mflops=152672.95
Column=003520 Fraction=0.085 Mflops=153281.83
Column=003744 Fraction=0.090 Mflops=153972.23
Column=003936 Fraction=0.095 Mflops=154343.33
Column=004160 Fraction=0.100 Mflops=154393.87
Column=004352 Fraction=0.105 Mflops=154647.82
Column=004576 Fraction=0.110 Mflops=154806.10
Column=004768 Fraction=0.115 Mflops=154507.35
Column=004992 Fraction=0.120 Mflops=154748.81
Column=005184 Fraction=0.125 Mflops=155379.89
Column=005408 Fraction=0.130 Mflops=155442.94
Column=005600 Fraction=0.135 Mflops=155853.20
Column=005824 Fraction=0.140 Mflops=156367.52
Column=006016 Fraction=0.145 Mflops=156368.74
Column=006240 Fraction=0.150 Mflops=156354.94
Column=006432 Fraction=0.155 Mflops=156812.31
Column=006656 Fraction=0.160 Mflops=156513.54
Column=006848 Fraction=0.165 Mflops=156714.70
Column=007040 Fraction=0.170 Mflops=156829.58
Column=007264 Fraction=0.175 Mflops=156808.21
Column=007456 Fraction=0.180 Mflops=156945.13
Column=007680 Fraction=0.185 Mflops=156774.87
Column=007872 Fraction=0.190 Mflops=157046.93
Column=008096 Fraction=0.195 Mflops=157228.48
Column=012224 Fraction=0.295 Mflops=158077.34
Column=016384 Fraction=0.395 Mflops=157597.59
Column=020512 Fraction=0.495 Mflops=155301.31
Column=024640 Fraction=0.595 Mflops=154463.04
Column=028800 Fraction=0.695 Mflops=151308.87
Column=032928 Fraction=0.795 Mflops=149119.66
Column=037088 Fraction=0.895 Mflops=147743.44
Column=041216 Fraction=0.995 Mflops=146640.57
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR10R2C4 41408 32 4 8 323.51 1.463e+002
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0011793 ...... PASSED
============================================================================

Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
----------------------------------------------------------------------------

End of Tests.
============================================================================

Andrei_Moskalev__Int · ‎04-21-2009

Hi,

MKL MP LINPACK benchmark codesare similar to HPL MP LINPACK(except hybrid case)(you can compare sources of MKL MP LINAPCK to HPL MP LINPACK). Your results looks strange.... Are these results reproducible? Could you also show linking log? Also try to compile HPL codes by Intel compiler and to compare to MKL MP LINPACK results.