Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Bruce_C_1
Beginner
105 Views

MPI job error dapl_cma_active: CM ADDR ERROR

Hi,

When we run MPI job, found following error and job failed.

Here is MPI command:

mpirun -genv I_MPI_FABRICS shm:dapl -f mpi_hosts -perhost 48 -n 288 /path/binary

===================================================================================

...

alps8-21.cluster.nchc.org.tw:CMA:c7b:3327f700: 27333960 us(4008027 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c7f:652af700: 27843555 us(4008028 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c81:f86aa700: 27490415 us(4008032 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c83:6ed48700: 27477294 us(4008028 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c85:98223700: 27433706 us(4008031 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c69:ecb71700: 27398107 us(4008228 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-25.cluster.nchc.org.tw:CMA:80c2:435aa700: 27601304 us(4004013 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.7.46 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c6b:245aa700: 28623785 us(5187228 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-21.cluster.nchc.org.tw:CMA:c57:a5b87700: 29139082 us(4004020 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.22 retry (10)..
alps8-22.cluster.nchc.org.tw:CMA:a80c:82624700: 30619141 us(4005003 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.21 retry (9)..
alps8-22.cluster.nchc.org.tw:CMA:a80e:8ab41700: 30623757 us(4005005 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.21 retry (9)..
alps8-22.cluster.nchc.org.tw:CMA:a812:54b45700: 30645431 us(4005010 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.8.21 retry (9)..

===================================================================================

Could you please give some clue?

Bruce

 

0 Kudos
1 Reply
Bruce_C_1
Beginner
105 Views

Hi,

We also find error as following:

===================================================================================

alps7-46.cluster.nchc.org.tw:CMA:868d:11861700: 64597581 us(12 us): dapl_cma_active: ARP_ERR, retries(15) exhausted -> DST 10.3.6.15,33671
alps7-46.cluster.nchc.org.tw:CMA:868f:6826b700: 64038582 us(4008020 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.6.15 retry (1)..
alps7-46.cluster.nchc.org.tw:CMA:868f:6826b700: 64038595 us(13 us): dapl_cma_active: ARP_ERR, retries(15) exhausted -> DST 10.3.6.15,33675
alps7-46.cluster.nchc.org.tw:CMA:8690:c079e700: 64559733 us(4008018 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.6.15 retry (1)..
alps7-46.cluster.nchc.org.tw:CMA:8690:c079e700: 64559743 us(10 us): dapl_cma_active: ARP_ERR, retries(15) exhausted -> DST 10.3.6.15,33677
alps7-46.cluster.nchc.org.tw:CMA:8691:88fd8700: 64118576 us(4008021 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.6.15 retry (1)..
alps7-46.cluster.nchc.org.tw:CMA:8691:88fd8700: 64118586 us(10 us): dapl_cma_active: ARP_ERR, retries(15) exhausted -> DST 10.3.6.15,33679
alps7-46.cluster.nchc.org.tw:CMA:8692:62f6c700: 64071787 us(4008027 us!!!): dapl_cma_active: CM ADDR ERROR: -> DST 10.3.6.15 retry (1)..

 

===================================================================================