Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

fails for ranks comunication in deferent nodes

newbird
Beginner
1,387 Views

I have make a cluster by docker with two nodes. mpirun --version is 

root@oneapi-spark14:~/pWord2Vec/scripts# mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2021.1 Build 20201112 (id: b9c9d2fc5)
Copyright 2003-2020, Intel Corporation.

In one of my docker node, i run the comand like this:

mpirun -v -check_mpi -host localhost -n 1 /opt/intel/oneapi/mpi/2021.1.1/bin/IMB-MPI1 Sendrecv : -host oneapi-spark15 -n 1 /opt/intel/oneapi/mpi/2021.1.1/bin/IMB-MPI1 exit

I have export FI_PROVIDER=tcp and the error msg is:

root@oneapi-spark14:~/pWord2Vec/scripts# mpirun -v -check_mpi -host localhost -n 1 /opt/intel/oneapi/mpi/2021.1.1/bin/IMB-MPI1 Sendrecv : -host oneapi-spark15 -n 1 /opt/intel/oneapi/mpi/2021.1.1/bin/IMB-MPI1 exit
[mpiexec@oneapi-spark14] Launch arguments: /opt/intel/oneapi/intelpython/latest/bin//hydra_bstrap_proxy --upstream-host oneapi-spark14 --upstream-port 24000 --pgid 0 --launcher ssh --launcher-number 0 --port-range 24000:24100 --base-path /opt/intel/oneapi/intelpython/latest/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /opt/intel/oneapi/intelpython/latest/bin//hydra_pmi_proxy --usize -1 --preload libVTmc.so --auto-cleanup 1 --abort-signal 9 
[mpiexec@oneapi-spark14] Launch arguments: /usr/bin/ssh -q -x oneapi-spark15 /opt/intel/oneapi/intelpython/latest/bin//hydra_bstrap_proxy --upstream-host oneapi-spark14 --upstream-port 24000 --pgid 0 --launcher ssh --launcher-number 0 --port-range 24000:24100 --base-path /opt/intel/oneapi/intelpython/latest/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 1 --node-id 1 --subtree-size 1 /opt/intel/oneapi/intelpython/latest/bin//hydra_pmi_proxy --usize -1 --preload libVTmc.so --auto-cleanup 1 --abort-signal 9 
[proxy:0:1@oneapi-spark15] pmi cmd from fd 4: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:1@oneapi-spark15] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:1@oneapi-spark15] pmi cmd from fd 4: cmd=get_maxes
[proxy:0:1@oneapi-spark15] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:1@oneapi-spark15] pmi cmd from fd 4: cmd=get_appnum
[proxy:0:1@oneapi-spark15] PMI response: cmd=appnum appnum=1
[proxy:0:1@oneapi-spark15] pmi cmd from fd 4: cmd=get_my_kvsname
[proxy:0:1@oneapi-spark15] PMI response: cmd=my_kvsname kvsname=kvs_2039_0
[proxy:0:1@oneapi-spark15] pmi cmd from fd 4: cmd=get kvsname=kvs_2039_0 key=PMI_process_mapping
[proxy:0:1@oneapi-spark15] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
[proxy:0:1@oneapi-spark15] pmi cmd from fd 4: cmd=barrier_in
[proxy:0:0@oneapi-spark14] pmi cmd from fd 6: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0@oneapi-spark14] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@oneapi-spark14] pmi cmd from fd 6: cmd=get_maxes
[proxy:0:0@oneapi-spark14] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0@oneapi-spark14] pmi cmd from fd 6: cmd=get_appnum
[proxy:0:0@oneapi-spark14] PMI response: cmd=appnum appnum=0
[proxy:0:0@oneapi-spark14] pmi cmd from fd 6: cmd=get_my_kvsname
[proxy:0:0@oneapi-spark14] PMI response: cmd=my_kvsname kvsname=kvs_2039_0
[proxy:0:0@oneapi-spark14] pmi cmd from fd 6: cmd=get kvsname=kvs_2039_0 key=PMI_process_mapping
[proxy:0:0@oneapi-spark14] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
[proxy:0:0@oneapi-spark14] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:0@oneapi-spark14] PMI response: cmd=barrier_out
[proxy:0:1@oneapi-spark15] PMI response: cmd=barrier_out
[proxy:0:0@oneapi-spark14] pmi cmd from fd 6: cmd=put kvsname=kvs_2039_0 key=bc-0 value=mpi#02005DE2AC1100020000000000000000$
[proxy:0:0@oneapi-spark14] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0@oneapi-spark14] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:1@oneapi-spark15] pmi cmd from fd 4: cmd=put kvsname=kvs_2039_0 key=bc-1 value=mpi#02005DF5AC1100020000000000000000$
[proxy:0:1@oneapi-spark15] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:1@oneapi-spark15] pmi cmd from fd 4: cmd=barrier_in
[proxy:0:0@oneapi-spark14] PMI response: cmd=barrier_out
[proxy:0:0@oneapi-spark14] pmi cmd from fd 6: cmd=get kvsname=kvs_2039_0 key=bc-0
[proxy:0:0@oneapi-spark14] PMI response: cmd=get_result rc=0 msg=success value=mpi#02005DE2AC1100020000000000000000$
[proxy:0:0@oneapi-spark14] pmi cmd from fd 6: cmd=get kvsname=kvs_2039_0 key=bc-1
[proxy:0:0@oneapi-spark14] PMI response: cmd=get_result rc=0 msg=success value=mpi#02005DF5AC1100020000000000000000$
[proxy:0:1@oneapi-spark15] PMI response: cmd=barrier_out
[proxy:0:1@oneapi-spark15] pmi cmd from fd 4: cmd=get kvsname=kvs_2039_0 key=bc-0
[proxy:0:1@oneapi-spark15] PMI response: cmd=get_result rc=0 msg=success value=mpi#02005DE2AC1100020000000000000000$
[proxy:0:1@oneapi-spark15] pmi cmd from fd 4: cmd=get kvsname=kvs_2039_0 key=bc-1
[proxy:0:1@oneapi-spark15] PMI response: cmd=get_result rc=0 msg=success value=mpi#02005DF5AC1100020000000000000000$
Abort(69326863) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Send: Other MPI error, error stack:
PMPI_Send(159)...................: MPI_Send(buf=0xb8c4f0, count=30, MPI_CHAR, dest=0, tag=5987, MPI_COMM_WORLD) failed
MPID_Send(771)...................: 
MPIDI_send_unsafe(220)...........: 
MPIDI_OFI_send_lightweight(40)...: 
MPIDI_OFI_inject_handler_vci(671): OFI tagged inject failed (ofi_impl.h:671:MPIDI_OFI_inject_handler_vci:Connection refused)
[proxy:0:1@oneapi-spark15] pmi cmd from fd 4: cmd=abort exitcode=69326863

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 2044 RUNNING AT localhost
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

My cluster can ssh without password to each other.

root@oneapi-spark14:~/pWord2Vec/scripts# mpirun -n 1 -host oneapi-spark15 hostname
oneapi-spark15
root@oneapi-spark14:~/pWord2Vec/scripts# mpirun -n 1 -host localhost hostname
oneapi-spark14

Also i sure that the firewalld is disabled and selinux is disable too.I don't konw why the err come. My docker build base ubuntu,but running on the centos. Dose it  make errors? 

0 Kudos
1 Solution
Michael_Intel
Moderator
1,186 Views

Hello,


Yes, you need to make sure that all MPI ranks have full visibility of your virtual (containerized) cluster. There should be a clear list of nodes in /etc/hosts while each node / container should be able to ssh keyless into each peer node, all sitting on the same network.


Also, please make sure to launch both MPI ranks with the same target application - in your below script you have listed one rank to run the SendRecv benchmark only, while the other rank targets the whole IMB benchmark suite.


Best regards,

Michael


View solution in original post

0 Kudos
7 Replies
newbird
Beginner
1,379 Views

another problem is that env i found.

 

root@oneapi-spark14:~/pWord2Vec/scripts# ssh oneapi-spark15 env | grep -i path
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
root@oneapi-spark14:~/pWord2Vec/scripts# ssh oneapi-spark15                   
Welcome to Ubuntu 18.04.5 LTS (GNU/Linux 3.10.0-862.el7.x86_64 x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage
This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.
Last login: Mon Dec 28 09:13:35 2020 from 10.102.13.2
root@oneapi-spark15:~# env | grep -i ld_ 
LD_LIBRARY_PATH=/opt/intel/oneapi/itac/2021.1.1/slib:/opt/intel/oneapi/debugger/10.0.0/dep/lib:/opt/intel/oneapi/debugger/10.0.0/libipt/intel64/lib:/opt/intel/oneapi/debugger/10.0.0/gdb/intel64/lib:/opt/intel/oneapi/mpi/2021.1.1//libfabric/lib:/opt/intel/oneapi/mpi/2021.1.1//lib/release:/opt/intel/oneapi/mpi/2021.1.1//lib:/opt/intel/oneapi/mkl/latest/lib/intel64:/opt/intel/oneapi/ccl/2021.1.1/lib/cpu_gpu_dpcpp:/opt/intel/oneapi/compiler/2021.1.1/linux/lib:/opt/intel/oneapi/compiler/2021.1.1/linux/lib/x64:/opt/intel/oneapi/compiler/2021.1.1/linux/lib/emu:/opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/linux64/lib:/opt/intel/oneapi/compiler/2021.1.1/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/compiler/2021.1.1/linux/compiler/lib:/opt/intel/oneapi/ipp/2021.1.1/lib/intel64:/opt/intel/oneapi/vpl/2021.1.1/lib:/opt/intel/oneapi/dal/2021.1.1/lib/intel64:/opt/intel/oneapi/dnnl/2021.1.1/cpu_dpcpp_gpu_dpcpp/lib:/opt/intel/oneapi/ippcp/2021.1.1/lib/intel64:/opt/intel/oneapi/tbb/2021.1.1/export/../lib/intel64/gcc4.8

 

0 Kudos
PrasanthD_intel
Moderator
1,347 Views

Hi,


Thanks for reaching out to us.

Could you please elaborate on the node setup that you were using?

i) Do you have two separate nodes (oneapi-spark14 & oneapi-spark15 on which you have installed oneAPI HPC toolkit docker containers

ii) Or have you installed a hpc container on a single node

ii) Have you installed two containers on a single node

Also, let us know the interconnect you were using.


Regarding the "another problem is that env i found."

It is not an error as the execution shell is non-interactive over ssh. That is the reason you will find fewer variables inside $PATH.

Please refer to this link (Why does an SSH remote command get fewer environment variables then when run manually? - Stack Overflow) if you need a detailed answer.


Regards

Prasanth


0 Kudos
PrasanthD_intel
Moderator
1,320 Views

Hi,


Could you please reply to our questions? It will help us gain better insights into what you were trying to do and what might have caused the error.


Regards

Prasanth


0 Kudos
newbird
Beginner
1,274 Views

Thansk for your reply!

First , I have run the two docker in two machines, and the docker is build with installed oneAPI HPC!

The Dockerfile is :

# Copyright (c) 2019-2020 Intel Corporation.
# SPDX-License-Identifier: BSD-3-Clause

# requires oneapi-basekit image, assumes oneapi dnf/yum repo is configured
ARG base_image="intel/oneapi-basekit:devel-centos8"
FROM "$base_image"

# install Intel(R) oneAPI HPC Toolkit
RUN dnf install -y \
intel-hpckit-getting-started \
intel-oneapi-clck \
intel-oneapi-common-licensing \
intel-oneapi-common-vars \
intel-oneapi-compiler-dpcpp-cpp-and-cpp-classic \
intel-oneapi-compiler-fortran \
intel-oneapi-dev-utilities \
intel-oneapi-inspector \
intel-oneapi-itac \
intel-oneapi-mpi-devel \
--

# setvars.sh environment variables
ENV ACL_BOARD_VENDOR_PATH='/opt/Intel/OpenCLFPGA/oneAPI/Boards'
ENV ADVISOR_2021_DIR='/opt/intel/oneapi/advisor/2021.1.1'
ENV APM='/opt/intel/oneapi/advisor/2021.1.1/perfmodels'
ENV CCL_ATL_TRANSPORT_PATH='/opt/intel/oneapi/ccl/2021.1.1/lib/cpu_gpu_dpcpp'
ENV CCL_CONFIGURATION='cpu_gpu_dpcpp'
ENV CCL_ROOT='/opt/intel/oneapi/ccl/2021.1.1'
ENV CLASSPATH='/opt/intel/oneapi/mpi/2021.1.1//lib/mpi.jar:/opt/intel/oneapi/dal/2021.1.1/lib/onedal.jar'
ENV CLCK_ROOT='/opt/intel/oneapi/clck/2021.1.1'
ENV CMAKE_PREFIX_PATH='/opt/intel/oneapi/vpl:/opt/intel/oneapi/tbb/2021.1.1/env/..:'
ENV CONDA_DEFAULT_ENV='base'
ENV CONDA_EXE='/opt/intel/oneapi/intelpython/latest/bin/conda'
ENV CONDA_PREFIX='/opt/intel/oneapi/intelpython/latest'
ENV CONDA_PROMPT_MODIFIER='(base) '
ENV CONDA_PYTHON_EXE='/opt/intel/oneapi/intelpython/latest/bin/python'
ENV CONDA_SHLVL='1'
ENV CPATH='/opt/intel/oneapi/dpl/2021.1.1/linux/include:/opt/intel/oneapi/dev-utilities/2021.1.1/include:/opt/intel/oneapi/mpi/2021.1.1//include:/opt/intel/oneapi/mkl/latest/include:/opt/intel/oneapi/ccl/2021.1.1/include/cpu_gpu_dpcpp:/opt/intel/oneapi/compiler/2021.1.1/linux/include:/opt/intel/oneapi/ipp/2021.1.1/include:/opt/intel/oneapi/vpl/2021.1.1/include:/opt/intel/oneapi/dal/2021.1.1/include:/opt/intel/oneapi/dnnl/2021.1.1/cpu_dpcpp_gpu_dpcpp/lib:/opt/intel/oneapi/ippcp/2021.1.1/include:/opt/intel/oneapi/tbb/2021.1.1/env/../include'
ENV CPLUS_INCLUDE_PATH='/opt/intel/oneapi/clck/2021.1.1/include'
ENV DAALROOT='/opt/intel/oneapi/dal/2021.1.1'
ENV DALROOT='/opt/intel/oneapi/dal/2021.1.1'
ENV DAL_MAJOR_BINARY='1'
ENV DAL_MINOR_BINARY='0'
ENV DNNLROOT='/opt/intel/oneapi/dnnl/2021.1.1/cpu_dpcpp_gpu_dpcpp'
ENV FI_PROVIDER_PATH='/opt/intel/oneapi/mpi/2021.1.1//libfabric/lib/prov:/usr/lib64/libfabric'
ENV INFOPATH='/opt/intel/oneapi/debugger/10.0.0/documentation/info/'
ENV INSPECTOR_2021_DIR='/opt/intel/oneapi/inspector/2021.1.1'
ENV INTELFPGAOCLSDKROOT='/opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga'
ENV INTEL_LICENSE_FILE='/opt/intel/licenses:/root/intel/licenses:/opt/intel/oneapi/clck/2021.1.1/licensing:/opt/intel/licenses:/root/intel/licenses:/Users/Shared/Library/Application Support/Intel/Licenses'
ENV INTEL_PYTHONHOME='/opt/intel/oneapi/debugger/10.0.0/dep'
ENV IPPCP_TARGET_ARCH='intel64'
ENV IPPCRYPTOROOT='/opt/intel/oneapi/ippcp/2021.1.1'
ENV IPPROOT='/opt/intel/oneapi/ipp/2021.1.1'
ENV IPP_TARGET_ARCH='intel64'
ENV I_MPI_ROOT='/opt/intel/oneapi/mpi/2021.1.1'
ENV LANG='C.UTF-8'
ENV LD_LIBRARY_PATH='/opt/intel/oneapi/itac/2021.1.1/slib:/opt/intel/oneapi/debugger/10.0.0/dep/lib:/opt/intel/oneapi/debugger/10.0.0/libipt/intel64/lib:/opt/intel/oneapi/debugger/10.0.0/gdb/intel64/lib:/opt/intel/oneapi/mpi/2021.1.1//libfabric/lib:/opt/intel/oneapi/mpi/2021.1.1//lib/release:/opt/intel/oneapi/mpi/2021.1.1//lib:/opt/intel/oneapi/mkl/latest/lib/intel64:/opt/intel/oneapi/ccl/2021.1.1/lib/cpu_gpu_dpcpp:/opt/intel/oneapi/compiler/2021.1.1/linux/lib:/opt/intel/oneapi/compiler/2021.1.1/linux/lib/x64:/opt/intel/oneapi/compiler/2021.1.1/linux/lib/emu:/opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/linux64/lib:/opt/intel/oneapi/compiler/2021.1.1/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/compiler/2021.1.1/linux/compiler/lib:/opt/intel/oneapi/ipp/2021.1.1/lib/intel64:/opt/intel/oneapi/vpl/2021.1.1/lib:/opt/intel/oneapi/dal/2021.1.1/lib/intel64:/opt/intel/oneapi/dnnl/2021.1.1/cpu_dpcpp_gpu_dpcpp/lib:/opt/intel/oneapi/ippcp/2021.1.1/lib/intel64:/opt/intel/oneapi/tbb/2021.1.1/env/../lib/intel64/gcc4.8'
ENV LIBRARY_PATH='/opt/intel/oneapi/clck/2021.1.1/lib/intel64:/opt/intel/oneapi/mpi/2021.1.1//libfabric/lib:/opt/intel/oneapi/mpi/2021.1.1//lib/release:/opt/intel/oneapi/mpi/2021.1.1//lib:/opt/intel/oneapi/mkl/latest/lib/intel64:/opt/intel/oneapi/ccl/2021.1.1/lib/cpu_gpu_dpcpp:/opt/intel/oneapi/compiler/2021.1.1/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/compiler/2021.1.1/linux/lib:/opt/intel/oneapi/ipp/2021.1.1/lib/intel64:/opt/intel/oneapi/vpl/2021.1.1/lib:/opt/intel/oneapi/dal/2021.1.1/lib/intel64:/opt/intel/oneapi/dnnl/2021.1.1/cpu_dpcpp_gpu_dpcpp/lib:/opt/intel/oneapi/ippcp/2021.1.1/lib/intel64:/opt/intel/oneapi/tbb/2021.1.1/env/../lib/intel64/gcc4.8'
ENV MANPATH='/opt/intel/oneapi/itac/2021.1.1/man:/opt/intel/oneapi/clck/2021.1.1/man:/opt/intel/oneapi/debugger/10.0.0/documentation/man:/opt/intel/oneapi/mpi/2021.1.1/man::/opt/intel/oneapi/compiler/2021.1.1/documentation/en/man/common::::'
ENV MKLROOT='/opt/intel/oneapi/mkl/latest'
ENV NLSPATH='/opt/intel/oneapi/mkl/latest/lib/intel64/locale/%l_%t/%N'
ENV OCL_ICD_FILENAMES='libintelocl_emu.so:libalteracl.so:/opt/intel/oneapi/compiler/2021.1.1/linux/lib/x64/libintelocl.so'
ENV ONEAPI_ROOT='/opt/intel/oneapi'
ENV PATH='/opt/intel/oneapi/inspector/2021.1.1/bin64:/opt/intel/oneapi/itac/2021.1.1/bin:/opt/intel/oneapi/itac/2021.1.1/bin:/opt/intel/oneapi/clck/2021.1.1/bin/intel64:/opt/intel/oneapi/debugger/10.0.0/gdb/intel64/bin:/opt/intel/oneapi/dev-utilities/2021.1.1/bin:/opt/intel/oneapi/intelpython/latest/bin:/opt/intel/oneapi/intelpython/latest/condabin:/opt/intel/oneapi/mpi/2021.1.1/libfabric/bin:/opt/intel/oneapi/mpi/2021.1.1/bin:/opt/intel/oneapi/vtune/2021.1.1/bin64:/opt/intel/oneapi/mkl/latest/bin/intel64:/opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/llvm/aocl-bin:/opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/bin:/opt/intel/oneapi/compiler/2021.1.1/linux/bin/intel64:/opt/intel/oneapi/compiler/2021.1.1/linux/bin:/opt/intel/oneapi/compiler/2021.1.1/linux/ioc/bin:/opt/intel/oneapi/advisor/2021.1.1/bin64:/opt/intel/oneapi/vpl/2021.1.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'
ENV PKG_CONFIG_PATH='/opt/intel/oneapi/inspector/2021.1.1/include/pkgconfig/lib64:/opt/intel/oneapi/vtune/2021.1.1/include/pkgconfig/lib64:/opt/intel/oneapi/mkl/latest/tools/pkgconfig:/opt/intel/oneapi/advisor/2021.1.1/include/pkgconfig/lib64:'
ENV PYTHONPATH='/opt/intel/oneapi/advisor/2021.1.1/pythonapi'
ENV SETVARS_COMPLETED='1'
ENV TBBROOT='/opt/intel/oneapi/tbb/2021.1.1/env/..'
ENV VPL_BIN='/opt/intel/oneapi/vpl/2021.1.1/bin'
ENV VPL_INCLUDE='/opt/intel/oneapi/vpl/2021.1.1/include'
ENV VPL_LIB='/opt/intel/oneapi/vpl/2021.1.1/lib'
ENV VPL_ROOT='/opt/intel/oneapi/vpl/2021.1.1'
ENV VTUNE_PROFILER_2021_DIR='/opt/intel/oneapi/vtune/2021.1.1'
ENV VT_ADD_LIBS='-ldwarf -lelf -lvtunwind -lm -lpthread'
ENV VT_LIB_DIR='/opt/intel/oneapi/itac/2021.1.1/lib'
ENV VT_MPI='impi4'
ENV VT_ROOT='/opt/intel/oneapi/itac/2021.1.1'
ENV VT_SLIB_DIR='/opt/intel/oneapi/itac/2021.1.1/slib'
ENV _CE_CONDA=''
ENV _CE_M=''

# prepare no password ssh
RUN yum -y update && \
    yum -y install passwd openssl openssh-server openssh-clients numactl

RUN ssh-keygen -f /root/.ssh/id_rsa -t rsa -N '' && cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
RUN /usr/bin/ssh-keygen -A

RUN echo "/usr/sbin/sshd" >> ~/.bashrc

RUN echo 'root:123456' |chpasswd  && \
        sed -ri 's/^#?PermitRootLogin\s+.*/PermitRootLogin yes/' /etc/ssh/sshd_config && \
        sed -ri 's/UsePAM yes/#UsePAM yes/g' /etc/ssh/sshd_config

RUN echo "Host *" >> ~/.ssh/config && \
        echo "StrictHostKeyChecking no" >> ~/.ssh/config && \
        echo "LogLevel ERROR" >> ~/.ssh/config && \
        echo "UserKnownHostsFile /dev/null" >> ~/.ssh/config && \
        echo "port 2222" >> ~/.ssh/config

RUN mkdir /var/run/sshd  
EXPOSE 22

I guess maybe the docker network different to the host machine cause that ranks in different docker can't communicate with other. In docker `ifconfig -a` is different with host machine.

0 Kudos
PrasanthD_intel
Moderator
1,191 Views

Hi,


We are escalating this issue to the internal team for better support.

Sorry for the delay, they will get back to you soon.


Regards

Prasanth


0 Kudos
Michael_Intel
Moderator
1,187 Views

Hello,


Yes, you need to make sure that all MPI ranks have full visibility of your virtual (containerized) cluster. There should be a clear list of nodes in /etc/hosts while each node / container should be able to ssh keyless into each peer node, all sitting on the same network.


Also, please make sure to launch both MPI ranks with the same target application - in your below script you have listed one rank to run the SendRecv benchmark only, while the other rank targets the whole IMB benchmark suite.


Best regards,

Michael


0 Kudos
Michael_Intel
Moderator
1,175 Views

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only


0 Kudos
Reply