Before mpiexec.hydra to start bootstrap, there is a long period to wait.
How I start:
export OMP_NUM_THREADS=4 export I_MPI_HYDRA_IFACE=ib0 export I_MPI_HYDRA_DEBUG=1 mpiexec.hydra -hostfile /home/software/hostfiles/hostfile_128 -genvall -ppn 12 ./wrf.exe
Then I wait about 3min, I get the following output:
[mpiexec@6248r-node128] Launch arguments: /opt/compiler/intel/oneapi/mpi/2021.2.0//bin//hydra_bstrap_proxy --upstream-host 18.104.22.168 --upstream-port 39744 --pgid 0 --launcher ssh --launcher-number 0 --base-path /opt/compiler/intel/oneapi/mpi/2021.2.0//bin/ --tree-width 16 --tree-level 1 --iface ib0 --time-left -1 --launch-type 2 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /opt/compiler/intel/oneapi/mpi/2021.2.0//bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
I don't know what mpiexec is doing before launch bootstrap proxy?
Why it wait so long to start it?
The hostfile /home/software/hostfiles/hostfile_128 just contains one host on which I run the command.
So I run a similar command which achieve the same result except remove hostfile:
mpiexec.hydra -genvall -ppn 48 ./wrf.exe
I get the normal output immediately.
It seems that mpiexec.hydra take a long time to resolve the hostfile. I don't know if it's right.
I have add '22.214.171.124 6248r-node128' in /etc/hosts.
I found the problem. But I don't think I find the best way to solve it.
When I delete the DNS in /etc/resolve, the app starts immediately.
So I think the delay is caused by hostname resolving of hosts in hostfile.
How should I disable remote DNS resolving? I just wanna to use local /etc/hosts to resolve the hostnames.
I found 'I_MPI_HYDRA_NAMESERVER = hostname:port'. But I don't know how to set it. hostname=localhost, what about the port?
Thanks for reaching out to us.
Could you please try running the same steps with & without "export I_MPI_HYDRA_IFACE=ib0" and share with us any change in behavior of the outcome?
We recommend you to use the below-like command for running your sample. since you are using a single node you can use "-n 12" instead of "-ppn 12". Use of '-genvall' flag is not required in this context.
mpiexec.hydra -hostfile <path-of-hostfile> -n 12 ./wrf.exe
After running the program, could you please share with us the output logs for both cases i.e with & without "export I_MPI_HYDRA_IFACE=ib0".
>>"I found 'I_MPI_HYDRA_NAMESERVER = hostname:port'. But I don't know how to set it. hostname=localhost, what about the port?"
Use the below commands:
hydra_nameserver & I_MPI_HYDRA_NAMESERVER=`hostname` mpiexec.hydra -n 12 ./wrf.exe
And also could you please provide information about the job scheduler?
Thanks & Regards,
I have resolved the problem. The delay is caused by nameserver.
Hosts in hostfile is resolved remotely by ips list in /etc/resolve. After I delete all DNS in /etc/resolve, the app starts immediately after pressing enter.
But I don't think deleting DNS ips in /etc/resolve is a good method to solve the problem.
I'm searching for a better way such as when runing mpiexec.hydra I can disable remote DNS or disable /etc/resolve or explictly indicate to use local DNS (/etc/hostnames).
This issue is outside the scope of Intel MPI Library. Please enable higher priority on /etc/hosts than on the DNS resolver daemon using suitable methods for your OS distribution. Discussions of the following nature might help you,
Is there anything else I can help you with?
This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.