How does l_mklb work when running across a cluster?

Chris_C_2 · ‎07-09-2017

Hi all,

Currently attempting to run l_mklb across a 110x node cluster, but I seem to be missing the understanding of the best syntax to run with.

Relevant items:

20 Ps, 22 Qs, NB=192, 1237056 Ns...

Inside the runme_intel64_static I set:

export MPI_PROC_NUM=440

export MPI_PER_NODE=4

#mpirun -perhost ${MPI_PER_NODE} -np ${MPI_PROC_NUM} ./runme_intel64_prv "$@" | tee -a $OUT <-- This was the original command

mpirun -np ${MPI_PROC_NUM} -machinefile hostlist /mnt/shared/benchmarks/runme_intel64_prv "$@" | tee -a $OUT

Right now on a 110 node cluster with 128GB RAM per node on Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz nodes... I'm seeing starting numbers of around 150TFlops.

I would expect to see more... So I guess my question is:

What are the best settings for runme_intel64_static?

On a normal HPL run i'd set the number of processes to the actual number of cores in the system but if I do that using runme_intel64_static, I totally oversubscribe the nodes and the performance goes through the floor.

If someone can explain what each variable does inside the script so I can work out how to saturate the cluster efficiently, that would be great.

Ying_H_Intel · ‎07-11-2017

Hi Chris,

How about to try let each node had only 1 MPI rank and OpenMP threads using by default.

export MPI_PROC_NUM='The number of actual physical server, which equals PxQ) may 110 here.

export MPI_PER_NODE=1

and other configuration in

https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/605789

https://software.intel.com/en-us/articles/performance-tools-for-software-developers-hpl-application-note

Best Regards,

Ying