Community
cancel
Showing results for 
Search instead for 
Did you mean: 
dona__lorenzo
Beginner
267 Views

IntelMPI Intel19 update 4 error

Dear All Good afternoon I successfully installed intel parallel studio 19 update 4 on my cluster based on Ubuntu 18.04 LTS The cluster is composed by 4 nodes: a master and 3 other nodes where I run my calculations. I am able to run calculations on the master only or on the nodes only or togheter. But when I try to ask for master+ one of the nodes I receive this message error: Abort(543240207) on node 7 (rank 7 in comm 0): Fatal error in PMPI_Bcast: Other MPI error, error stack: PMPI_Bcast(452)...................: MPI_Bcast(buf=0x5b7cee0, count=10, MPI_INTEGER, root=7, comm=MPI_COMM_WORLD) failed PMPI_Bcast(438)...................: MPIDI_SHMGR_Gather_generic(391)...: MPIDI_NM_mpi_bcast(161)...........: MPIR_Bcast_intra_tree(227)........: Failure during collective MPIR_Bcast_intra_tree(219)........: MPIR_Bcast_intra_tree_generic(180): Failure during collective And also when I run the MPI-Benchmarks as : mpirun -hosts master,node1 -n 2 -ppn 1 /opt/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/IMB-MPI1 I receive this error message Abort(609312527) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack: PMPI_Comm_split(507)........................: MPI_Comm_split(MPI_COMM_WORLD, color=0, key=1, new_comm=0x6de6e4) failed PMPI_Comm_split(489)........................: MPIR_Comm_split_impl(167)...................: MPIR_Allgather_intra_auto(145)..............: Failure during collective MPIR_Allgather_intra_auto(141)..............: MPIR_Allgather_intra_recursive_doubling(126): MPIC_Sendrecv(344)..........................: MPID_Isend(662).............................: MPID_isend_unsafe(282)......................: MPIDI_OFI_send_lightweight_request(106).....: (unknown)(): Other MPI error I tried also to install parallel studio 19 update 5 but the problem is still the same All the best Lorenzo
0 Kudos
3 Replies
PrasanthD_intel
Moderator
267 Views

Hi,

Thanks for reaching out to us.

We are working on this query and will get back to you.

 

Prasanth

dona__lorenzo
Beginner
267 Views

Thanks to help me and for your fast reply

I give you more details of my cluster.

The master node is also the login node and it has 2 IP address as reported below:

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

        inet 192.168.1.101  netmask 255.255.255.0  broadcast 192.168.1.255

        inet6 fe80::ec4:7aff:fe78:73f4  prefixlen 64  scopeid 0x20<link>

        ether 0c:c4:7a:78:73:f4  txqueuelen 1000  (Ethernet)

        RX packets 102348  bytes 25759571 (25.7 MB)

        RX errors 0  dropped 35  overruns 0  frame 0

        TX packets 35752  bytes 5759844 (5.7 MB)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

        device memory 0xc7d00000-c7d7ffff

 

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

        inet 10.0.0.1  netmask 255.255.255.0  broadcast 10.0.0.255

        inet6 fe80::ec4:7aff:fe78:73f5  prefixlen 64  scopeid 0x20<link>

        ether 0c:c4:7a:78:73:f5  txqueuelen 1000  (Ethernet)

        RX packets 95984811  bytes 103960161249 (103.9 GB)

        RX errors 0  dropped 0  overruns 0  frame 0

        TX packets 98633593  bytes 109716265658 (109.7 GB)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

        device memory 0xc7c00000-c7c7ffff

 

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536

        inet 127.0.0.1  netmask 255.0.0.0

        inet6 ::1  prefixlen 128  scopeid 0x10<host>

        loop  txqueuelen 1000  (Loopback locale)

        RX packets 176450  bytes 1091561329 (1.0 GB)

        RX errors 0  dropped 0  overruns 0  frame 0

        TX packets 176450  bytes 1091561329 (1.0 GB)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

while the noes have the follow ip configuration:

eth0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500

        ether 0c:c4:7a:76:17:1c  txqueuelen 1000  (Ethernet)

        RX packets 0  bytes 0 (0.0 B)

        RX errors 0  dropped 0  overruns 0  frame 0

        TX packets 0  bytes 0 (0.0 B)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

        device memory 0xc7d00000-c7d7ffff

 

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

        inet 10.0.0.2  netmask 255.255.255.0  broadcast 10.0.0.255

        inet6 fe80::8713:663e:8261:e7d4  prefixlen 64  scopeid 0x20<link>

        ether 0c:c4:7a:76:17:1d  txqueuelen 1000  (Ethernet)

        RX packets 213840131  bytes 232703810855 (232.7 GB)

        RX errors 49  dropped 0  overruns 0  frame 36

        TX packets 210730735  bytes 225976171140 (225.9 GB)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

        device memory 0xc7c00000-c7c7ffff

 

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536

        inet 127.0.0.1  netmask 255.0.0.0

        inet6 ::1  prefixlen 128  scopeid 0x10<host>

        loop  txqueuelen 1000  (Loopback locale)

        RX packets 283037  bytes 1284137604 (1.2 GB)

        RX errors 0  dropped 0  overruns 0  frame 0

        TX packets 283037  bytes 1284137604 (1.2 GB)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

and here you can find the list of the hosts

127.0.0.1       localhost

10.0.0.1        master

10.0.0.2        node1

10.0.0.3        node2

10.0.0.4        node3

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback

fe00::0 ip6-localnet

ff00::0 ip6-mcastprefix

ff02::1 ip6-allnodes

ff02::2 ip6-allrouters

I also tested parallel_studio_2017_update_4 and everything works fine

All the best

lorenzo

248 Views

Has the issue been solved? Did you try any of the latest updates, see https://software.intel.com/content/www/us/en/develop/articles/intel-parallel-studio-xe-supported-and...


Reply