Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Amirsaman_M_
Beginner
248 Views

Issue with Intel MPI library on Microsoft Azure machines

Hi,

I am trying to setup RDMA between two Azure VMs using Intel's MPI library (v5.1.1.109). Both machines can remotely connect to the other machine using ssh and using the pingpong utility in the following way, I can get latency numbers without any errors.

/opt/intel/impi/5.1.1.109/bin64/mpirun -hosts 10.0.0.5,10.0.0.6 -ppn 1 -n 2 -env I_MPI_FABRICS tcp -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 /opt/intel/impi/5.1.1.109/bin64/IMB-MPI1 pingpong

However, if I try to run the pingpong utility to get latency numbers for RDMA over IB, I will get the following error:

/opt/intel/impi/5.1.1.109/bin64/mpirun -hosts 10.0.0.5,10.0.0.6 -ppn 1 -n 2 -env I_MPI_FABRICS dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 /opt/intel/impi/5.1.1.109/bin64/IMB-MPI1 pingpong
active-copy-1:b9a:438d8700: 4006660 us(4006660 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (15)..
active-copy-1:b9a:438d8700: 8014586 us(4007926 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (14)..
active-copy-1:b9a:438d8700: 12022590 us(4008004 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (13)..
active-copy-1:b9a:438d8700: 16030610 us(4008020 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (12)..
active-copy-1:b9a:438d8700: 20030594 us(3999984 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (11)..
active-copy-1:b9a:438d8700: 24038599 us(4008005 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (10)..
active-copy-1:b9a:438d8700: 28046628 us(4008029 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (9)..
active-copy-1:b9a:438d8700: 32054598 us(4007970 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (8)..
active-copy-1:b9a:438d8700: 36062598 us(4008000 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (7)..
active-copy-1:b9a:438d8700: 40070580 us(4007982 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (6)..
active-copy-1:b9a:438d8700: 44078630 us(4008050 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (5)..
active-copy-1:b9a:438d8700: 48086598 us(4007968 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (4)..
active-copy-1:b9a:438d8700: 52094611 us(4008013 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (3)..
active-copy-1:b9a:438d8700: 56102588 us(4007977 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (2)..
active-copy-1:b9a:438d8700: 60110613 us(4008025 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 172.16.1.193 retry (1)..
active-copy-1:b9a:438d8700: 60110625 us(12 us): dapl_cma_active: ARP_ERR, retries(15) exhausted -> DST 172.16.1.193,3313
[0:10.0.0.5] unexpected DAPL event 0x4008
Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(784):
MPID_Init(1326)......: channel initialization failed
MPIDI_CH3_Init(141)..:
(unknown)(): Internal MPI error!

I tried disabling the firewall and running the utility from the other machine, but neither worked! But, if I set both hosts to the IP address of the local machine, I will get the latency numbers. I suspect there is something wrong with the interface or the way these machines try to find each other, but I have no idea what could be the fix. Any idea what is going wrong here?

0 Kudos
4 Replies
Artem_R_Intel1
Employee
248 Views

Hello Amirsaman,

Why do you use I_MPI_DAPL_PROVIDER=ofa-v2-ib0? Could you please try to run the same scenario without I_MPI_DAPL_PROVIDER variable (with default DAPL provider).

Amirsaman_M_
Beginner
248 Views

Artem, thanks for your reply. I tried running the tool without I_MPI_DAPL_PROVIDER and now I am getting the following error:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 2879 RUNNING AT 10.0.0.6
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 2879 RUNNING AT 10.0.0.6
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

Regardless of which machine I run the command from, I get the same error (the IP is the same). I am not familiar with MPI, but I guess using the default value for I_MPI_DAPL_PROVIDER is causing a segmentation fault.

Amirsaman_M_
Beginner
248 Views

I found the fix. It was my mistake to create VMs in different availability sets. Moving them under the same availability set, now I am able to run the pingpong utility.

Mark_W_3
Beginner
248 Views

I am also having a problem using RDMA on 2 Azure A8s running SLES HPC 12. In my case I DO have them both in the same availability set and getting a different error:

> /opt/intel/impi/5.0.3.048/bin64/mpirun -hosts 10.0.0.4,10.0.0.5 -ppn 1 -n 2 -env I_MPI_FABRICS=shm:dapl -env I_MPI_DYNAMIC_CONNECTION=0 -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 /opt/intel/impi_latest/bin64/IMB-MPI1 pingpong
sshvm1:eb0:cdcf9b40: 13254 us(13254 us):  dapl_rdma_accept: ERR -1 Input/output error
sshvm1:eb0:cdcf9b40: 13271 us(17 us):  DAPL ERR accept Input/output error
[1:10.0.0.5][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:622] error(0x40000): ofa-v2-ib0: could not accept DAPL connection request: DAT_INTERNAL_ERROR()
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c at line 622: 0
internal ABORT - process 0

 

 

without the provider setting:

> /opt/intel/impi/5.0.3.048/bin64/mpirun -hosts 10.0.0.4,10.0.0.5 -ppn 1 -n 2 -env I_MPI_FABRICS=shm:dapl -env I_MPI_DYNAMIC_CONNECTION=0 /opt/intel/impi_latest/bin64/IMB-MPI1 pingpong

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3391 RUNNING AT 10.0.0.4
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

Reply