Hi gridpc g,
From http://ark.intel.com/compare/75272,81706. The two CPU feature are look like below
|Intel® Xeon® Processor E5-2660 v2 (25M Cache, 2.20 GHz)||Intel® Xeon® Processor E5-2660 v3 (25M Cache, 2.60 GHz)|
|Ivy Bridge EP||Haswell|
Instruction Set Extensions
Could you please provide the details about HPL.dat and which binary are you running?
Are you using the Intel optimized linkpack binary from https://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download?
Or, are you building the HPL yourself? If yes, what are the compiler/link options? Also, provide platform details for both.
Yes, providing the details above will help. The recommended HPL binary is the offload version: mp_linpack/bin_intel/intel64/xhpl_offload_intel64 . The NB value can be set to 192 for v3 systems. You can first run this binary with 1 MPI rank per node. Later you may try to run it with 1 MPI per socket to get the best performance using the scripts provided: mp_linpack/bin_intel/intel64/runme_offload_intel64.
I have same problem more or less, I have two set of computing nodes (16 of 2690 V2 procesor type nodes & 64 of 2690 V3 processor type nodes) which tried to get Linpack benchmarks . First I tried Intel Parallel Studio with 16* V2 computing nodes and I got about 92% performance efficiency which is great.
in second step I tried to run same Linpack over 16 * V3 type processors but I just got 74% efficency . the configuration is same, even I reinstalled Paralle studio on V3 processor type computing nodes but still have same problem.
By the way, running Linpack over one single node gives 87% performance but over 16 or 64 will drop to 74%.
Please kindly support me to solve the problem.
Hi gridpc and Reza,
As Efe recommended, could you the offload binary that I mentioned above, even if there is no Xeon Phi in your systems. We will be avoiding this confusion soon, having just one single binary out there.
I tried make arch=intel64 version=offload but the was an error which means couldn't find offload library. I used offload pre-build binary in bin_intel64 folder but the results was very poor and I couldn't understand the results structure. is it per each node or combination of ?
on the other hand, cpu utilization was not equal in this test, some core had 200% and some of them was free.
HPL performance is limited by the slowest node. Please check if all nodes perform 87% equally. Also performance tends to be low if problem size is too small due to communication overhead. Please try to increase problem size.