MKL in Linpack runs on one core

phillippower · ‎09-17-2009

I am attempting to run Linpack on a cluster. When using 4 nodes (Q=2, P=2) linpack runs but the MKL DGEMM is only using a single core. When I run Linpack outside of the mpirun command on a single node (Q=1, P=1) MKL DGEMM uses all 8 cores. I set OMP_NUM_THREADS=8 as part of the run_xhpl script and have checked that this environment is correct set on each node prior to running linpack. Does anyone know why only one core is being used?

My launch script includes the following:

/opt/intel/impi/3.2/0.011/bin/mpdboot -r ssh --ncpus=8

/opt/intel/impi/3.2/0.011/bin/mpdtrace

/opt/intel/impi/3.2/0.011/bin/mpirun -ppn 1 -n 4 ./run_xhpl

/opt/intel/impi/3.2/0.011/bin/mpdallexit

Tabrez_Ali · ‎09-17-2009

Quoting - phillippower

I am attempting to run Linpack on a cluster. When using 4 nodes (Q=2, P=2) linpack runs but the MKL DGEMM is only using a single core. When I run Linpack outside of the mpirun command on a single node (Q=1, P=1) MKL DGEMM uses all 8 cores. I set OMP_NUM_THREADS=8 as part of the run_xhpl script and have checked that this environment is correct set on each node prior to running linpack. Does anyone know why only one core is being used?

My launch script includes the following:

/opt/intel/impi/3.2/0.011/bin/mpdboot -r ssh --ncpus=8

/opt/intel/impi/3.2/0.011/bin/mpdtrace

/opt/intel/impi/3.2/0.011/bin/mpirun -ppn 1 -n 4 ./run_xhpl

/opt/intel/impi/3.2/0.011/bin/mpdallexit

dgemm is a part of BLAS and not LINPACK (which btw has long been superceded by LAPACK)

Afaik MKL uses multithreaded BLAS 3 so your dgemm can use 8 cores. However multithreaded BLAS 3 is for SMP machines only and therefore cannot directly run on distributed machines via MPI.

For dense matmul on distributed machines you need PBLAS http://www.netlib.org/scalapack/pblas_qref.html

phillippower · ‎09-17-2009

Thanks for your response. I do not want DGEMM to run over the cluster... the HPL application will run over the cluster using MPI (one instance of HPL per node). Each instance of HPL on the nodes will call the local MKL library DGEMM. Therefore I am expecting the standard MKL DGEMM to execute over the 8 cores (as defined by OMP_NUM_THREADS). Are my expectations just wrong?

TimP · ‎09-17-2009

If you do wish to run OpenMP via mpirun, you should look up the provisions made by your choice of MPI, and consider that certain MPI implementations (including Intel's) demand use of MPI_Init_thread in accordance with MPI standard, with suitable arguments to permit running threaded processes. Did you follow the HPL method in the benchmarks/mp_linpack of a recent MKL distribution?

Tabrez_Ali · ‎09-19-2009

Quoting - phillippower

Thanks for your response. I do not want DGEMM to run over the cluster... the HPL application will run over the cluster using MPI (one instance of HPL per node). Each instance of HPL on the nodes will call the local MKL library DGEMM. Therefore I am expecting the standard MKL DGEMM to execute over the 8 cores (as defined by OMP_NUM_THREADS). Are my expectations just wrong?

Yes that should be possible. For example if you have 4 nodes with 8 cores each and if you run your app as "mpirun -np 4 ./a.out" with OMP_NUM_THREADS as 8 then each node should be able to utilize all 8 cores.

Though you want to make sure that OMP_NUM_THREADS variable is being passed correctly.

TimP · ‎09-19-2009

Quoting - Tabrez Ali

Yes that should be possible. For example if you have 4 nodes with 8 cores each and if you run your app as "mpirun -np 4 ./a.out" with OMP_NUM_THREADS as 8 then each node should be able to utilize all 8 cores.

Though you want to make sure that OMP_NUM_THREADS variable is being passed correctly.

You would also gain by appropriate OpenMP affinity settings on each node (KMP_AFFINITY, for Intel OpenMP).
With recent versions of Intel MPI (or possibly certain others), you could try other divisions of work between OpenMP and MPI, such as
I_MPI_PIN_DOMAIN=omp
mpirun -np 8 ....
so as to run 1 OpenMP process per socket.