Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
7266 ディスカッション

MKL in Linpack runs on one core

phillippower
ビギナー
2,088件の閲覧回数
I am attempting to run Linpack on a cluster. When using 4 nodes (Q=2, P=2) linpack runs but the MKL DGEMM is only using a single core. When I run Linpack outside of the mpirun command on a single node (Q=1, P=1) MKL DGEMM uses all 8 cores. I set OMP_NUM_THREADS=8 as part of the run_xhpl script and have checked that this environment is correct set on each node prior to running linpack. Does anyone know why only one core is being used?

My launch script includes the following:
/opt/intel/impi/3.2/0.011/bin/mpdboot -r ssh --ncpus=8
/opt/intel/impi/3.2/0.011/bin/mpdtrace
/opt/intel/impi/3.2/0.011/bin/mpirun -ppn 1 -n 4 ./run_xhpl
/opt/intel/impi/3.2/0.011/bin/mpdallexit

0 件の賞賛
5 返答(返信)
Tabrez_Ali
ビギナー
2,088件の閲覧回数
Quoting - phillippower
I am attempting to run Linpack on a cluster. When using 4 nodes (Q=2, P=2) linpack runs but the MKL DGEMM is only using a single core. When I run Linpack outside of the mpirun command on a single node (Q=1, P=1) MKL DGEMM uses all 8 cores. I set OMP_NUM_THREADS=8 as part of the run_xhpl script and have checked that this environment is correct set on each node prior to running linpack. Does anyone know why only one core is being used?

My launch script includes the following:
/opt/intel/impi/3.2/0.011/bin/mpdboot -r ssh --ncpus=8
/opt/intel/impi/3.2/0.011/bin/mpdtrace
/opt/intel/impi/3.2/0.011/bin/mpirun -ppn 1 -n 4 ./run_xhpl
/opt/intel/impi/3.2/0.011/bin/mpdallexit

dgemm is a part of BLAS and not LINPACK (which btw has long been superceded by LAPACK)

Afaik MKL uses multithreaded BLAS 3 so your dgemm can use 8 cores. However multithreaded BLAS 3 is for SMP machines only and therefore cannot directly run on distributed machines via MPI.

For dense matmul on distributed machines you need PBLAS http://www.netlib.org/scalapack/pblas_qref.html
phillippower
ビギナー
2,088件の閲覧回数
Thanks for your response. I do not want DGEMM to run over the cluster... the HPL application will run over the cluster using MPI (one instance of HPL per node). Each instance of HPL on the nodes will call the local MKL library DGEMM. Therefore I am expecting the standard MKL DGEMM to execute over the 8 cores (as defined by OMP_NUM_THREADS). Are my expectations just wrong?
TimP
名誉コントリビューター III
2,088件の閲覧回数
If you do wish to run OpenMP via mpirun, you should look up the provisions made by your choice of MPI, and consider that certain MPI implementations (including Intel's) demand use of MPI_Init_thread in accordance with MPI standard, with suitable arguments to permit running threaded processes. Did you follow the HPL method in the benchmarks/mp_linpack of a recent MKL distribution?
Tabrez_Ali
ビギナー
2,088件の閲覧回数
Quoting - phillippower
Thanks for your response. I do not want DGEMM to run over the cluster... the HPL application will run over the cluster using MPI (one instance of HPL per node). Each instance of HPL on the nodes will call the local MKL library DGEMM. Therefore I am expecting the standard MKL DGEMM to execute over the 8 cores (as defined by OMP_NUM_THREADS). Are my expectations just wrong?
Yes that should be possible. For example if you have 4 nodes with 8 cores each and if you run your app as "mpirun -np 4 ./a.out" with OMP_NUM_THREADS as 8 then each node should be able to utilize all 8 cores.

Though you want to make sure that OMP_NUM_THREADS variable is being passed correctly.
TimP
名誉コントリビューター III
2,088件の閲覧回数
Quoting - Tabrez Ali
Yes that should be possible. For example if you have 4 nodes with 8 cores each and if you run your app as "mpirun -np 4 ./a.out" with OMP_NUM_THREADS as 8 then each node should be able to utilize all 8 cores.

Though you want to make sure that OMP_NUM_THREADS variable is being passed correctly.
You would also gain by appropriate OpenMP affinity settings on each node (KMP_AFFINITY, for Intel OpenMP).
With recent versions of Intel MPI (or possibly certain others), you could try other divisions of work between OpenMP and MPI, such as
I_MPI_PIN_DOMAIN=omp
mpirun -np 8 ....
so as to run 1 OpenMP process per socket.
返信