Severe slowdown with PDSYGST and PDSYEVX for 64 cores

L__D__Marks · ‎08-01-2012

I am currently benchmarking a small cluster on a vendor's system, and noticed a very severe slowdown with 64 cores, and even slightly for 32 cores. The slowdown is specific to both of these routines, and they take twice as long with 64 cores as they do for 32. And, yes, I really do mean twice as long.

The vendor has scomposer_xe_2011_sp1.11.339which I used for the tests. The mkl fromcomposerxe-2011.3.174 (which I had access to) is slightly better, but not a lot. From /proc/cpuinfo these areIntel Xeon CPU E5-2660 0 @ 2.20GHz machines, 16 cores per node with IB, openmpi-1.4.5.

Any suggestions? (It is not a coding issue or anything else, the code being used is a DFT standard.)

Andrei_Moskalev__Int · ‎08-01-2012

Could you please try to run these tests with IntelMPI as well?

L__D__Marks · ‎08-02-2012

I have requested that the vendor install it. I will provide more information later today.

TimP · ‎08-02-2012

If you are using an MPI process per core, you should activate the core pinning option of OpenMPI, if you haven't done so, as well as using the mkl_sequential. Latest development versions of OpenMPI should include options to support multiple MPI/OpenMP hybrid processes per node, as Intel MPI has done for several years. Intel MPI also provides for recognition of active HyperThreading and using a single process per core; I'm doubtful of OpenMPI in that mode.

L__D__Marks · ‎08-02-2012

Thanks. I will also ask that they install the latest openmpi. They are currently doing some tests on the system to see if there is anything else going wrong.

N.B., hyperthreading is off, and I am using the sequential libraries.

L__D__Marks · ‎08-03-2012

Hmmm. The Intel MPI appears to be both substantially faster and not have the same scaling problems. Since I am borrowing use of a test cluster I cannot say exactly what the issue was.

I have posted information on a listerver for the specific DFT code (Wien2k) since others may want to start using Intel MPI a the current version seems to be rather good.

For the record, the timings are with the first number the cores, the second nodes:

16 1: TIME HAMILT (CPU) = 7.5, HNS = 8.0, DIAG = 61.3

32 2: TIME HAMILT (CPU) = 5.1, HNS = 4.4, DIAG = 40.8

48 3: TIME HAMILT (CPU) = 4.1, HNS = 3.2, DIAG = 31.8

64 4: TIME HAMILT (CPU) = 3.4, HNS = 2.6, DIAG = 25.1

The "HAMILT" and "HNS" parts of the code are mainly simple mpi, i.e. spliiting of the effort over different machines. Both scale well with both Intel MPI and openmpi, with openmpi being perhaps slightly faster although the difference was small enough to be noise.

The "DIAG" part of the code is dominated by the scalapack calls PDSYGST & PDSYEVX. This does not scale quite as well as the others, but does scale relatively well. These were scaling badly with the version of openmpi that I was provided with.

N.B., if anyone wants to provide additional options to test to see if they make any difference I may still have access to the test cluster for a bit.