Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
7032 Discussions

How does l_mklb work when running across a cluster?


Hi all,

Currently attempting to run l_mklb across a 110x node cluster, but I seem to be missing the understanding of the best syntax to run with.

Relevant items:

20 Ps, 22 Qs, NB=192, 1237056 Ns...

Inside the runme_intel64_static I set:

export MPI_PROC_NUM=440


export MPI_PER_NODE=4


#mpirun -perhost ${MPI_PER_NODE} -np ${MPI_PROC_NUM} ./runme_intel64_prv "$@" | tee -a $OUT <-- This was the original command

mpirun -np ${MPI_PROC_NUM} -machinefile hostlist /mnt/shared/benchmarks/runme_intel64_prv "$@" | tee -a $OUT

Right now on a 110 node cluster with 128GB RAM per node on Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz nodes... I'm seeing starting numbers of around 150TFlops.

I would expect to see more... So I guess my question is:

What are the best settings for runme_intel64_static?

On a normal HPL run i'd set the number of processes to the actual number of cores in the system but if I do that using runme_intel64_static, I totally oversubscribe the nodes and the performance goes through the floor.

If someone can explain what each variable does inside the script so I can work out how to saturate the cluster efficiently, that would be great.



0 Kudos
1 Reply

Hi Chris,

How about to try let each node had only 1 MPI  rank and OpenMP threads using by default.

export MPI_PROC_NUM='The number of actual physical server, which equals PxQ) may 110 here.

export MPI_PER_NODE=1

and other configuration in

Best Regards,


0 Kudos