- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
Currently attempting to run l_mklb across a 110x node cluster, but I seem to be missing the understanding of the best syntax to run with.
Relevant items:
20 Ps, 22 Qs, NB=192, 1237056 Ns...
Inside the runme_intel64_static I set:
export MPI_PROC_NUM=440
export MPI_PER_NODE=4
#mpirun -perhost ${MPI_PER_NODE} -np ${MPI_PROC_NUM} ./runme_intel64_prv "$@" | tee -a $OUT <-- This was the original command
mpirun -np ${MPI_PROC_NUM} -machinefile hostlist /mnt/shared/benchmarks/runme_intel64_prv "$@" | tee -a $OUT
Right now on a 110 node cluster with 128GB RAM per node on Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz nodes... I'm seeing starting numbers of around 150TFlops.
I would expect to see more... So I guess my question is:
What are the best settings for runme_intel64_static?
On a normal HPL run i'd set the number of processes to the actual number of cores in the system but if I do that using runme_intel64_static, I totally oversubscribe the nodes and the performance goes through the floor.
If someone can explain what each variable does inside the script so I can work out how to saturate the cluster efficiently, that would be great.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Chris,
How about to try let each node had only 1 MPI rank and OpenMP threads using by default.
export MPI_PROC_NUM='The number of actual physical server, which equals PxQ) may 110 here.
export MPI_PER_NODE=1
and other configuration in
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/605789
Best Regards,
Ying

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page