Poor performance of PDGEMM using hybrid MPI/OpenMP

Kai_Hebeler · ‎10-30-2012

Hi,

I need to compute products of large distributed matrices. For that I use a hybrid MPI/OpenMP strategy. As a check I performed some runtime tests using the functions DGEMM and PDGEMM on a single node of a cluster using 48 cores with OpenMP. I assumed that these two functions should essentially be equivalent on a single node. However, I observed a dramatic performance difference: The DGEMM function scales almost perfectly with the number of OpenMP threads and I find a runtime of about 7sec for calculating the product of two dense 10000x10000 matrices using 48 cores. For PDGEMM the scaling behavior is much worse and the same operation takes about 60 sec for the same matrix on the same machine. Is this behavor understood and is there a way to fix it?

I use the Intel C++ compiler 12.1 and the corresponding MKL library. The link chain looks like:

/usr/local/intel-lcs-2012.0.032/mkl/lib/intel64/libmkl_scalapack_lp64.a /usr/local/intel-lcs-2012.0.032/mkl/lib/intel64/libmkl_blacs_openmpi_lp64.a -Wl,--start-group /usr/local/intel-lcs-2012.0.032/mkl/lib/intel64/libmkl_intel_lp64.a /usr/local/intel-lcs-2012.0.032/mkl/lib/intel64/libmkl_intel_thread.a /usr/local/intel-lcs-2012.0.032/mkl/lib/intel64/libmkl_core.a -Wl,--end-group

Thanks,

Kai

Murat_G_Intel · ‎10-31-2012

Hi Kai, You're comparing the PDGEMM against OpenMP DGEMM on a shared-memory system, right? DGEMM is specifically tuned for shared-memory systems and PDGEMM typically have more data traffic due to the distributed matrices. You may want to make sure that MPI implementation takes advantage of the shared-memory system. Intel MPI should do that automatically. You may want to give it a try. However, I wouldn't expect PDGEMM to match the DGEMM performance. How is the scaling of the PDGEMM? Is it possible for you to provide timing for different number of processes (say 1, 2, 4, 8, 16 etc...)? Thank you, Efe

TimP · ‎10-31-2012

If you respond with further details, it would be useful to know the specifics of your system. Is it one where the default Intel MPI affinity for hybrid is working? If not, are you taking steps to implement your own affinity? If it is a hyperthreaded system, MKL DGEMM would limit by default to 1 thread per core. Does the same happen with PDGEMM? Does PDGEMM optimize the number of MPI processes, e.g. 1, 2 or 3 ranks per socket, if on Westmere CPUs, for which the hybrid method is more likely to prove useful than on other Intel models?

Kai_Hebeler · ‎10-31-2012

Hi Efe, thank you for your quick response. Yes, I did these tests on a shared-memory system. I understand that DGEMM is specifically tuned for these environments and I don't expect PDGEMM to fully match the DGEMM performance. However, the observed difference was much larger than I expected. In particular, note that the matrix for PDGEMM in these tests was not distributed, i.e. the array descriptor is trivial, and the block size and also the quality of the internode connection should make no difference. I did some more detailed runtime tests. I obtain the following numbers for different number of cores: cores DGEMM PDGEMM 1 304 464 2 161 231 4 80 145 8 41 84 16 20 68 32 11 56 48 7 62 whereas I observed significant fluctuations in the PDGEMM times using many cores. I think these number clearly indicate that there seems to be some traffic problem for PDGEMM. So far I used openMPI, I will try to switch to IntelMPI next. Regarding the specifics of the system: Our local cluster consists of AMD opteron CPUs. I plan to use this machine for tests only. For production runs I plan to use a cluster based on Intel Xeon x5650 CPUs (12 cores per node, InfiniBand). Thank you, Kai

TimP · ‎10-31-2012

For proper affinity in MPI/OpenMP hybrid on shared memory: With OpenMPI, you will need a recent (beta?) version which supports affinity for hybrid mode, and you will need to read instructions and set appropriate options. This probably means building your own copy of OpenMPI. I don't know about compatibility of newer OpenMPI with MKL. Intel MPI will probably not default to doing the right thing on the AMD cluster. You will need to work with I_MPI_PIN_DOMAIN settings. As far as Intel MPI is concerned, the experts on this will more likely be watching the companion HPC and cluster forum. OpenMPI has a good mailing list help forum.