Problem appears to be related to my configuration where hosts have different numbers of nodes (cores) on each host. Using v2020 update 1 studio's mpi with examples written in Fortran.
This example: host nas3 is an 18 core host and mirella is a 8 core host. Each host is configured with dual 10gig-e NICs using adaptive load balancing and a private subnet. The hosts are running a current version of ubuntu.
mpiexec -genvall -genv I_MPI_DEBUG=5 \
-genv I_MPI_FABRICS=shm:ofi \
-genv I_MPI_ADJUST_ALLTOALL 2
-host nas3 -n 8 ./ft.$1.16 : \
-host mirella3 -n 8 ./ft.$1.16
The first host pins 8 ranks on alternate cores, but the second host's pinning is a complete mystery. The problem looks different and even worse when the first host is the 8 core host. I've experimented with explicit pinning directives for each host and with -ppn without affecting the outcome.
V2018 (pre-ofi) pins the processes as I direct using simply -host x -n nx ./app.
Suggestions? This really bad pinning behavior makes OFI performance look sad.
I have seen your logs and observed that every time you try to launch MPI ranks on nas3 node the process id's were zeroes. Is that the problem or am I missing something?
Could you please run the application only on nas3 and post us the debug info.
The last host listed on the mpiexec command is the one that displays the 0s. It's a symptom of the problem. One result is poorer performance when using more hosts. The problem is less severe when the first hosts on the mpiexec list have more cores than the remaining hosts. The problem looks worse when the first host has 4 cores and the last host has more.
Not using a job scheduler. Using mpiexec in a bash script. I think mpiexec is linked to mpiexec.hydra.
Single program on all hosts. The same executable is run by all MPI ranks. The compilers and MPI and my source and executables all reside on a shared filesystem.
The attached file is an example running the same program with the same compiler but with MPI v2018.5. I'm including it here to show that the previous program treated my use of the -n option correctly. Also this file includes two different data cases. One is known as Class C where the array sizes are 512x512x512 and the Class D case where the array sizes are 2048x1024x1024. All data is double precision complex (16 bytes). Same host environment as I've previously reported. This version also permits me to change the order that I reference hosts on the mpiexec command. This version also provides better (lower) wall clock times for each data case.
I do have working experience using the machinefile option. I performed tests using this approach and got the same mishandling of pinning as the -n/-hosts options on the mpiexec command line. The machinefile option was just a bit more cumbersome to use with other placement options while I tried to find a solution to this problem. I also couldn't see how to make use of I_MPI_PIN_PROCESSOR_LIST with machinefile. This PIN option is also mishandled by this same problem. In earlier versions of 2019, I had to use FI_SOCKETS_IFACE for each host, and this, too, didn't seem to fit with machinefile.
I'll make the test runs you requested here a bit later on today.
We have seen that you are using Ubuntu 20 which is not supported yet.
We support Ubuntu 16,04,18.04,19.04.
For more info please refer to System Requirements section in :
Which job scheduler were you using?
Could you once try with legacy mpiexec.hydra ,you can find the executable in (impi/2019.7.217/intel64/bin/legacy) folder, and tell us if you are still facing the issue?
Also since you are launching a single programme on multiple nodes and not launching multiple programs, instead of using argument sets can you once try machine file to launch MPI on multiple nodes.
nas3 : 8
mirella3 : 8
and launch MPI by providing this file as -f hostfile
mpirun -machinefile hosts.txt -n 8 ./hello
For more info on writing machinefile and launching controlling process placement please refer this link:https://software.intel.com/content/www/us/en/develop/articles/controlling-process-placement-with-the-intel-mpi-library.html
Sorry for the delay in response..
We have suggested using machinefile since MPMD is mainly used to run multiple executables, but the shortcomings you have mentioned are valid too. Its mainly depends on the use-case.
Coming to the debug reports you have provided in ft_C_and_D_3hosts_v2018MPI_2 and Intel_1_test everything went fine, but the order is reversed, is that intentional?
And in Intel_2_test in the second run, the distribution is incorrect, as you have given in hostfile as nas3:12 mirella3:4 and MPI have distributed as nas3:8, mirella3:8. But that is when using the legacy hydra. The legacy hydra is different from the hydra you will get in /bin. We have suggested using legacy hydra since your CPU is old.
We haven't seen any error in reports while running with mpirun, can you provide the same logs while keeping I_MPI_DEBUG=10 so we can see the pinning information too?
I ran these test listing nas3 as the first of the two host systems and with I_MPI_DEBUG=10. Also used mpiexec followed by a legacy/mpiexec.hydra - see Intel_3_test.txt. Then I tried a run starting with the same host list on nas3 and got a different outcome from the runs I started on mirella3 (nas_Intel_3_test.txt).
No I_MPI_DEBUG log entries showed up maybe because we're running a test program that doesn't call anything in the MPI library. I'm going to switch to a simple, small PI example that makes just a few MPI calls. I'll send these results shortly.
From the logs, we can see that rank distribution is incorrect and in some cases, the processes were not even launched as per the machine file.
This issue seems like a bug so I am escalating this to our internal team.
We will get back to you soon.
thanks for your help on this. Looks to me like this version is not accurately paying attention to non-heterogeneous hosts: different numbers of cores on each host. With the 2018 version I can run the same mpirun with the same pinning selections on any of the hosts and get the same correct pinning consistent across all the hosts.
FYI the E5-2697 v4 was launched 1Q16. Besides clocking memory at 2.4GHz, it provides 4 way access to memory.