- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I'm encountering an unexpected deadlock in this Fortran test program, compiled using Parallel Studio XE 2017 Update 4 on an Amazon EC2 cluster (Linux system).
$ mpiifort -traceback nbtest.f90 -o test.x
On one node, the program runs just fine, but any more and it deadlocks, leading me to suspect a internode comm failure, but my knowledge in this area is lacking. FYI, the test code is hardcoded to be run on 16 cores.
Any help or insight is appreciated!
Danny
Code
program nbtest use mpi implicit none !***____________________ Definitions _______________ integer, parameter :: r4 = SELECTED_REAL_KIND(6,37) integer :: irank integer, allocatable :: gstart1(:) integer, allocatable :: gend1(:) integer, allocatable :: gstartz(:) integer, allocatable :: gendz(:) integer, allocatable :: ind_fl(:) integer, allocatable :: blen(:),disp(:) integer, allocatable :: ddt_recv(:),ddt_send(:) real(kind=r4), allocatable :: tmp_array(:,:,:) real(kind=r4), allocatable :: tmp_in(:,:,:) integer :: cnt, i, j integer :: count_send, count_recv integer :: ssend integer :: srecv integer :: esend integer :: erecv integer :: erecv2, srecv2 integer :: mpierr, ierr, old, typesize, typesize2,typesize3 integer :: mpi_requests(2*16) integer :: mpi_status_arr(MPI_STATUS_SIZE,2*16) character(MPI_MAX_ERROR_STRING) :: string integer :: resultlen integer :: errorcode !***________Code___________________________ !*_________initialize MPI__________________ call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,irank,ierr) call MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN,ierr) allocate(gstart1(0:15), & gend1(0:15), & gstartz(0:15), & gendz(0:15)) gstart1(0) = 1 gend1(0) = 40 gstartz(0) = 1 gendz(0) = 27 do i = 2, 16 gstart1(i-1) = gend1(i-2) + 1 gend1(i-1) = gend1(i-2) + 40 gstartz(i-1) = gendz(i-2) + 1 gendz(i-1) = gendz(i-2) + 27 end do allocate(ind_fl(15)) cnt = 1 do i = 1, 16 if ( (i-1) == irank ) cycle ind_fl(cnt) = (i - 1) cnt = cnt + 1 end do cnt = 1 do i = 1, 16 if ( (i-1) == irank ) cycle ind_fl(cnt) = (i - 1) cnt = cnt + 1 end do !*_________new datatype__________________ allocate(ddt_recv(16),ddt_send(16)) allocate(blen(60), disp(60)) call mpi_type_size(MPI_REAL,typesize,ierr) do i = 1, 15 call mpi_type_contiguous(3240,MPI_REAL, & ddt_send(i),ierr) call mpi_type_commit(ddt_send(i),ierr) srecv2 = (gstartz(ind_fl(i))-1)*2+1 erecv2 = gendz(ind_fl(i))*2 blen(:) = erecv2 - srecv2 + 1 do j = 1, 60 disp(j) = (j-1)*(852) + srecv2 - 1 end do call mpi_type_indexed(60,blen,disp,MPI_REAL, & ddt_recv(i),ierr) call mpi_type_commit(ddt_recv(i),ierr) old = ddt_recv(i) call mpi_type_create_resized(old,int(0,kind=MPI_ADDRESS_KIND),& int(51120*typesize,kind=MPI_ADDRESS_KIND),& ddt_recv(i),ierr) call mpi_type_free(old,ierr) call mpi_type_commit(ddt_recv(i),ierr) end do allocate(tmp_array(852,60,40)) allocate(tmp_in(54,60,640)) tmp_array = 0.0_r4 tmp_in = 0.0_r4 ssend = gstart1(irank) esend = gend1(irank) cnt = 0 do i = 1, 15 srecv = gstart1(ind_fl(i)) erecv = gend1(ind_fl(i)) ! Calculate the number of bytes to send (for MPI_SEND) count_send = erecv - srecv + 1 count_recv = esend - ssend + 1 cnt = cnt + 1 call mpi_irecv(tmp_array,count_recv,ddt_recv(i), & ind_fl(i),ind_fl(i),MPI_COMM_WORLD,mpi_requests(cnt),ierr) cnt = cnt + 1 call mpi_isend(tmp_in(:,:,srecv:erecv), & count_send,ddt_send(i),ind_fl(i), & irank,MPI_COMM_WORLD,mpi_requests(cnt),ierr) end do call mpi_waitall(cnt,mpi_requests(1:cnt),mpi_status_arr(:,1:cnt),ierr) if (ierr /= MPI_SUCCESS) then do i = 1,cnt errorcode = mpi_status_arr(MPI_ERROR,i) if (errorcode /= 0 .AND. errorcode /= MPI_ERR_PENDING) then call MPI_Error_string(errorcode,string,resultlen,mpierr) print *, "rank: ",irank, string !call MPI_Abort(MPI_COMM_WORLD,errorcode,ierr) end if end do end if deallocate(tmp_array) deallocate(tmp_in) print *, "great success" call MPI_FINALIZE(ierr) end program nbtest
Running gdb on one of the processors during the deadlock:
(gdb) bt
#0 0x00002acb4c6bf733 in __select_nocancel () from /lib64/libc.so.6
#1 0x00002acb4b496a2e in MPID_nem_tcp_connpoll () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12
#2 0x00002acb4b496048 in MPID_nem_tcp_poll () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12
#3 0x00002acb4b350020 in MPID_nem_network_poll () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12
#4 0x00002acb4b0cc5f2 in PMPIDI_CH3I_Progress () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12
#5 0x00002acb4b50328f in PMPI_Waitall () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12
#6 0x00002acb4ad1d53f in pmpi_waitall_ (v1=0x1e, v2=0xb0c320, v3=0x0, ierr=0x2acb4c6bf733 <__select_nocancel+10>) at ../../src/binding/fortran/mpif_h/waitallf.c:275
#7 0x00000000004064b0 in MAIN__ ()
#8 0x000000000040331e in main ()
Output log after I kill the job:
$ mpirun -n 16 ./test.x
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
test.x 000000000040C12A Unknown Unknown Unknown
libpthread-2.17.s 00002BA8B42F95A0 Unknown Unknown Unknown
libmpi.so.12 00002BA8B3303EBF PMPIDI_CH3I_Progr Unknown Unknown
libmpi.so.12 00002BA8B373B28F PMPI_Waitall Unknown Unknown
libmpifort.so.12. 00002BA8B2F5553F pmpi_waitall Unknown Unknown
test.x 00000000004064B0 MAIN__ 129 nbtest.f90
test.x 000000000040331E Unknown Unknown Unknown
libc-2.17.so 00002BA8B4829C05 __libc_start_main Unknown Unknown
test.x 0000000000403229 Unknown Unknown Unknown
(repeated 15 times, once for each processor)
Output with I_MPI_DEBUG = 6
[0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 3 Build 20170405 (id: 17193)
[0] MPI startup(): Copyright (C) 2003-2017 Intel Corporation. All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[12] MPI startup(): cannot open dynamic library libdat2.so.2
[7] MPI startup(): cannot open dynamic library libdat2.so.2
[10] MPI startup(): cannot open dynamic library libdat2.so.2
[13] MPI startup(): cannot open dynamic library libdat2.so.2
[4] MPI startup(): cannot open dynamic library libdat2.so.2
[9] MPI startup(): cannot open dynamic library libdat2.so.2
[14] MPI startup(): cannot open dynamic library libdat2.so.2
[5] MPI startup(): cannot open dynamic library libdat2.so.2
[11] MPI startup(): cannot open dynamic library libdat2.so.2
[15] MPI startup(): cannot open dynamic library libdat2.so.2
[6] MPI startup(): cannot open dynamic library libdat2.so.2
[8] MPI startup(): cannot open dynamic library libdat2.so.2
[0] MPI startup(): cannot open dynamic library libdat2.so.2
[3] MPI startup(): cannot open dynamic library libdat2.so.2
[2] MPI startup(): cannot open dynamic library libdat2.so.2
[4] MPI startup(): cannot open dynamic library libdat2.so
[7] MPI startup(): cannot open dynamic library libdat2.so
[8] MPI startup(): cannot open dynamic library libdat2.so
[9] MPI startup(): cannot open dynamic library libdat2.so
[6] MPI startup(): cannot open dynamic library libdat2.so
[10] MPI startup(): cannot open dynamic library libdat2.so
[13] MPI startup(): cannot open dynamic library libdat2.so
[0] MPI startup(): cannot open dynamic library libdat2.so
[15] MPI startup(): cannot open dynamic library libdat2.so
[3] MPI startup(): cannot open dynamic library libdat2.so
[12] MPI startup(): cannot open dynamic library libdat2.so
[4] MPI startup(): cannot open dynamic library libdat.so.1
[14] MPI startup(): cannot open dynamic library libdat2.so
[7] MPI startup(): cannot open dynamic library libdat.so.1
[5] MPI startup(): cannot open dynamic library libdat2.so
[8] MPI startup(): cannot open dynamic library libdat.so.1
[1] MPI startup(): cannot open dynamic library libdat2.so.2
[6] MPI startup(): cannot open dynamic library libdat.so.1
[9] MPI startup(): cannot open dynamic library libdat.so.1
[10] MPI startup(): cannot open dynamic library libdat.so.1
[0] MPI startup(): cannot open dynamic library libdat.so.1
[12] MPI startup(): cannot open dynamic library libdat.so.1
[4] MPI startup(): cannot open dynamic library libdat.so
[11] MPI startup(): cannot open dynamic library libdat2.so
[3] MPI startup(): cannot open dynamic library libdat.so.1
[13] MPI startup(): cannot open dynamic library libdat.so.1
[5] MPI startup(): cannot open dynamic library libdat.so.1
[15] MPI startup(): cannot open dynamic library libdat.so.1
[5] MPI startup(): cannot open dynamic library libdat.so
[7] MPI startup(): cannot open dynamic library libdat.so
[1] MPI startup(): cannot open dynamic library libdat2.so
[9] MPI startup(): cannot open dynamic library libdat.so
[8] MPI startup(): cannot open dynamic library libdat.so
[11] MPI startup(): cannot open dynamic library libdat.so.1
[6] MPI startup(): cannot open dynamic library libdat.so
[10] MPI startup(): cannot open dynamic library libdat.so
[14] MPI startup(): cannot open dynamic library libdat.so.1
[11] MPI startup(): cannot open dynamic library libdat.so
[13] MPI startup(): cannot open dynamic library libdat.so
[15] MPI startup(): cannot open dynamic library libdat.so
[12] MPI startup(): cannot open dynamic library libdat.so
[0] MPI startup(): cannot open dynamic library libdat.so
[14] MPI startup(): cannot open dynamic library libdat.so
[1] MPI startup(): cannot open dynamic library libdat.so.1
[3] MPI startup(): cannot open dynamic library libdat.so
[1] MPI startup(): cannot open dynamic library libdat.so
[2] MPI startup(): cannot open dynamic library libdat2.so
[2] MPI startup(): cannot open dynamic library libdat.so.1
[2] MPI startup(): cannot open dynamic library libdat.so
[4] MPI startup(): cannot load default tmi provider
[7] MPI startup(): cannot load default tmi provider
[5] MPI startup(): cannot load default tmi provider
[9] MPI startup(): cannot load default tmi provider
[0] MPI startup(): cannot load default tmi provider
[6] MPI startup(): cannot load default tmi provider
[10] MPI startup(): cannot load default tmi provider
[3] MPI startup(): cannot load default tmi provider
[15] MPI startup(): cannot load default tmi provider
[8] MPI startup(): cannot load default tmi provider
[1] MPI startup(): cannot load default tmi provider
[14] MPI startup(): cannot load default tmi provider
[11] MPI startup(): cannot load default tmi provider
[2] MPI startup(): cannot load default tmi provider
[12] MPI startup(): cannot load default tmi provider
[13] MPI startup(): cannot load default tmi provider
[12] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[4] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[9] ERROR - load_iblibrary(): [15] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[5] ERROR - load_iblibrary(): [0] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[10] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[1] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[3] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[13] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[7] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[2] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[6] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[8] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[11] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[14] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory
[0] MPI startup(): shm and tcp data transfer modes
[1] MPI startup(): shm and tcp data transfer modes
[2] MPI startup(): shm and tcp data transfer modes
[3] MPI startup(): shm and tcp data transfer modes
[4] MPI startup(): shm and tcp data transfer modes
[5] MPI startup(): shm and tcp data transfer modes
[7] MPI startup(): shm and tcp data transfer modes
[9] MPI startup(): shm and tcp data transfer modes
[8] MPI startup(): shm and tcp data transfer modes
[6] MPI startup(): shm and tcp data transfer modes
[10] MPI startup(): shm and tcp data transfer modes
[11] MPI startup(): shm and tcp data transfer modes
[12] MPI startup(): shm and tcp data transfer modes
[13] MPI startup(): shm and tcp data transfer modes
[14] MPI startup(): shm and tcp data transfer modes
[15] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): Device_reset_idx=1
[0] MPI startup(): Allgather: 4: 1-4 & 0-4
[0] MPI startup(): Allgather: 1: 5-11 & 0-4
[0] MPI startup(): Allgather: 4: 12-28 & 0-4
[0] MPI startup(): Allgather: 1: 29-1694 & 0-4
[0] MPI startup(): Allgather: 4: 1695-3413 & 0-4
[0] MPI startup(): Allgather: 1: 3414-513494 & 0-4
[0] MPI startup(): Allgather: 3: 513495-1244544 & 0-4
[0] MPI startup(): Allgather: 4: 0-2147483647 & 0-4
[0] MPI startup(): Allgather: 4: 1-16 & 5-16
[0] MPI startup(): Allgather: 1: 17-38 & 5-16
[0] MPI startup(): Allgather: 3: 0-2147483647 & 5-16
[0] MPI startup(): Allgather: 4: 1-8 & 17-2147483647
[0] MPI startup(): Allgather: 1: 9-23 & 17-2147483647
[0] MPI startup(): Allgather: 4: 24-35 & 17-2147483647
[0] MPI startup(): Allgather: 3: 0-2147483647 & 17-2147483647
[0] MPI startup(): Allgatherv: 1: 0-3669 & 0-4
[0] MPI startup(): Allgatherv: 4: 3669-4949 & 0-4
[0] MPI startup(): Allgatherv: 1: 4949-17255 & 0-4
[0] MPI startup(): Allgatherv: 4: 17255-46775 & 0-4
[0] MPI startup(): Allgatherv: 3: 46775-836844 & 0-4
[0] MPI startup(): Allgatherv: 4: 0-2147483647 & 0-4
[0] MPI startup(): Allgatherv: 4: 0-10 & 5-16
[0] MPI startup(): Allgatherv: 1: 10-38 & 5-16
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 5-16
[0] MPI startup(): Allgatherv: 4: 0-8 & 17-2147483647
[0] MPI startup(): Allgatherv: 1: 8-21 & 17-2147483647
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 17-2147483647
[0] MPI startup(): Allreduce: 5: 0-6 & 0-8
[0] MPI startup(): Allreduce: 7: 6-11 & 0-8
[0] MPI startup(): Allreduce: 5: 11-26 & 0-8
[0] MPI startup(): Allreduce: 4: 26-43 & 0-8
[0] MPI startup(): Allreduce: 5: 43-99 & 0-8
[0] MPI startup(): Allreduce: 1: 99-176 & 0-8
[0] MPI startup(): Allreduce: 6: 176-380 & 0-8
[0] MPI startup(): Allreduce: 2: 380-2967 & 0-8
[0] MPI startup(): Allreduce: 1: 2967-9460 & 0-8
[0] MPI startup(): Allreduce: 2: 0-2147483647 & 0-8
[0] MPI startup(): Allreduce: 5: 0-95 & 9-16
[0] MPI startup(): Allreduce: 1: 95-301 & 9-16
[0] MPI startup(): Allreduce: 2: 301-2577 & 9-16
[0] MPI startup(): Allreduce: 6: 2577-5427 & 9-16
[0] MPI startup(): Allreduce: 1: 5427-10288 & 9-16
[0] MPI startup(): Allreduce: 2: 0-2147483647 & 9-16
[0] MPI startup(): Allreduce: 6: 0-6 & 17-2147483647
[0] MPI startup(): Allreduce: 5: 6-11 & 17-2147483647
[0] MPI startup(): Allreduce: 6: 11-452 & 17-2147483647
[0] MPI startup(): Allreduce: 2: 452-2639 & 17-2147483647
[0] MPI startup(): Allreduce: 6: 2639-5627 & 17-2147483647
[0] MPI startup(): Allreduce: 1: 5627-9956 & 17-2147483647
[0] MPI startup(): Allreduce: 2: 9956-2587177 & 17-2147483647
[0] MPI startup(): Allreduce: 3: 0-2147483647 & 17-2147483647
[0] MPI startup(): Alltoall: 4: 1-16 & 0-8
[0] MPI startup(): Alltoall: 1: 17-69 & 0-8
[0] MPI startup(): Alltoall: 2: 70-1024 & 0-8
[0] MPI startup(): Alltoall: 2: 1024-52228 & 0-8
[0] MPI startup(): Alltoall: 4: 52229-74973 & 0-8
[0] MPI startup(): Alltoall: 2: 74974-131148 & 0-8
[0] MPI startup(): Alltoall: 3: 131149-335487 & 0-8
[0] MPI startup(): Alltoall: 4: 0-2147483647 & 0-8
[0] MPI startup(): Alltoall: 4: 1-16 & 9-16
[0] MPI startup(): Alltoall: 1: 17-40 & 9-16
[0] MPI startup(): Alltoall: 2: 41-497 & 9-16
[0] MPI startup(): Alltoall: 1: 498-547 & 9-16
[0] MPI startup(): Alltoall: 2: 548-1024 & 9-16
[0] MPI startup(): Alltoall: 2: 1024-69348 & 9-16
[0] MPI startup(): Alltoall: 4: 0-2147483647 & 9-16
[0] MPI startup(): Alltoall: 4: 0-1 & 17-2147483647
[0] MPI startup(): Alltoall: 1: 2-4 & 17-2147483647
[0] MPI startup(): Alltoall: 4: 5-24 & 17-2147483647
[0] MPI startup(): Alltoall: 2: 25-1024 & 17-2147483647
[0] MPI startup(): Alltoall: 2: 1024-20700 & 17-2147483647
[0] MPI startup(): Alltoall: 4: 20701-57414 & 17-2147483647
[0] MPI startup(): Alltoall: 3: 57415-66078 & 17-2147483647
[0] MPI startup(): Alltoall: 4: 0-2147483647 & 17-2147483647
[0] MPI startup(): Alltoallv: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Barrier: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Bcast: 4: 1-29 & 0-8
[0] MPI startup(): Bcast: 7: 30-37 & 0-8
[0] MPI startup(): Bcast: 4: 38-543 & 0-8
[0] MPI startup(): Bcast: 6: 544-1682 & 0-8
[0] MPI startup(): Bcast: 4: 1683-2521 & 0-8
[0] MPI startup(): Bcast: 6: 2522-30075 & 0-8
[0] MPI startup(): Bcast: 7: 30076-34889 & 0-8
[0] MPI startup(): Bcast: 4: 34890-131072 & 0-8
[0] MPI startup(): Bcast: 6: 131072-409051 & 0-8
[0] MPI startup(): Bcast: 7: 0-2147483647 & 0-8
[0] MPI startup(): Bcast: 4: 1-13 & 9-2147483647
[0] MPI startup(): Bcast: 1: 14-25 & 9-2147483647
[0] MPI startup(): Bcast: 4: 26-691 & 9-2147483647
[0] MPI startup(): Bcast: 6: 692-2367 & 9-2147483647
[0] MPI startup(): Bcast: 4: 2368-7952 & 9-2147483647
[0] MPI startup(): Bcast: 6: 7953-10407 & 9-2147483647
[0] MPI startup(): Bcast: 4: 10408-17900 & 9-2147483647
[0] MPI startup(): Bcast: 6: 17901-36385 & 9-2147483647
[0] MPI startup(): Bcast: 7: 36386-131072 & 9-2147483647
[0] MPI startup(): Bcast: 7: 0-2147483647 & 9-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gather: 2: 1-3 & 0-8
[0] MPI startup(): Gather: 3: 4-4 & 0-8
[0] MPI startup(): Gather: 2: 5-66 & 0-8
[0] MPI startup(): Gather: 3: 67-174 & 0-8
[0] MPI startup(): Gather: 2: 175-478 & 0-8
[0] MPI startup(): Gather: 3: 479-531 & 0-8
[0] MPI startup(): Gather: 2: 532-2299 & 0-8
[0] MPI startup(): Gather: 3: 0-2147483647 & 0-8
[0] MPI startup(): Gather: 2: 1-141 & 9-16
[0] MPI startup(): Gather: 3: 142-456 & 9-16
[0] MPI startup(): Gather: 2: 457-785 & 9-16
[0] MPI startup(): Gather: 3: 786-70794 & 9-16
[0] MPI startup(): Gather: 2: 70795-254351 & 9-16
[0] MPI startup(): Gather: 3: 0-2147483647 & 9-16
[0] MPI startup(): Gather: 2: 1-89 & 17-2147483647
[0] MPI startup(): Gather: 3: 90-472 & 17-2147483647
[0] MPI startup(): Gather: 2: 473-718 & 17-2147483647
[0] MPI startup(): Gather: 3: 719-16460 & 17-2147483647
[0] MPI startup(): Gather: 2: 0-2147483647 & 17-2147483647
[0] MPI startup(): Gatherv: 2: 0-2147483647 & 0-16
[0] MPI startup(): Gatherv: 2: 0-2147483647 & 17-2147483647
[0] MPI startup(): Reduce_scatter: 5: 0-5 & 0-4
[0] MPI startup(): Reduce_scatter: 1: 5-192 & 0-4
[0] MPI startup(): Reduce_scatter: 3: 192-349 & 0-4
[0] MPI startup(): Reduce_scatter: 1: 349-3268 & 0-4
[0] MPI startup(): Reduce_scatter: 3: 3268-71356 & 0-4
[0] MPI startup(): Reduce_scatter: 2: 71356-513868 & 0-4
[0] MPI startup(): Reduce_scatter: 5: 513868-731452 & 0-4
[0] MPI startup(): Reduce_scatter: 2: 731452-1746615 & 0-4
[0] MPI startup(): Reduce_scatter: 5: 1746615-2485015 & 0-4
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 0-4
[0] MPI startup(): Reduce_scatter: 5: 0-5 & 5-16
[0] MPI startup(): Reduce_scatter: 1: 5-59 & 5-16
[0] MPI startup(): Reduce_scatter: 5: 59-99 & 5-16
[0] MPI startup(): Reduce_scatter: 3: 99-198 & 5-16
[0] MPI startup(): Reduce_scatter: 1: 198-360 & 5-16
[0] MPI startup(): Reduce_scatter: 3: 360-3606 & 5-16
[0] MPI startup(): Reduce_scatter: 2: 3606-4631 & 5-16
[0] MPI startup(): Reduce_scatter: 3: 0-2147483647 & 5-16
[0] MPI startup(): Reduce_scatter: 5: 0-22 & 17-2147483647
[0] MPI startup(): Reduce_scatter: 1: 22-44 & 17-2147483647
[0] MPI startup(): Reduce_scatter: 5: 44-278 & 17-2147483647
[0] MPI startup(): Reduce_scatter: 3: 278-3517 & 17-2147483647
[0] MPI startup(): Reduce_scatter: 5: 3517-4408 & 17-2147483647
[0] MPI startup(): Reduce_scatter: 3: 0-2147483647 & 17-2147483647
[0] MPI startup(): Reduce: 4: 4-5 & 0-4
[0] MPI startup(): Reduce: 1: 6-59 & 0-4
[0] MPI startup(): Reduce: 2: 60-188 & 0-4
[0] MPI startup(): Reduce: 6: 189-362 & 0-4
[0] MPI startup(): Reduce: 2: 363-7776 & 0-4
[0] MPI startup(): Reduce: 5: 7777-151371 & 0-4
[0] MPI startup(): Reduce: 1: 0-2147483647 & 0-4
[0] MPI startup(): Reduce: 4: 4-60 & 5-16
[0] MPI startup(): Reduce: 3: 61-88 & 5-16
[0] MPI startup(): Reduce: 4: 89-245 & 5-16
[0] MPI startup(): Reduce: 3: 246-256 & 5-16
[0] MPI startup(): Reduce: 4: 257-8192 & 5-16
[0] MPI startup(): Reduce: 3: 8192-1048576 & 5-16
[0] MPI startup(): Reduce: 3: 0-2147483647 & 5-16
[0] MPI startup(): Reduce: 4: 4-8192 & 17-2147483647
[0] MPI startup(): Reduce: 3: 8192-1048576 & 17-2147483647
[0] MPI startup(): Reduce: 3: 0-2147483647 & 17-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 2: 1-7 & 0-16
[0] MPI startup(): Scatter: 3: 8-9 & 0-16
[0] MPI startup(): Scatter: 2: 10-64 & 0-16
[0] MPI startup(): Scatter: 3: 65-372 & 0-16
[0] MPI startup(): Scatter: 2: 373-811 & 0-16
[0] MPI startup(): Scatter: 3: 812-115993 & 0-16
[0] MPI startup(): Scatter: 2: 115994-173348 & 0-16
[0] MPI startup(): Scatter: 3: 0-2147483647 & 0-16
[0] MPI startup(): Scatter: 1: 1-1 & 17-2147483647
[0] MPI startup(): Scatter: 2: 2-76 & 17-2147483647
[0] MPI startup(): Scatter: 3: 77-435 & 17-2147483647
[0] MPI startup(): Scatter: 2: 436-608 & 17-2147483647
[0] MPI startup(): Scatter: 3: 0-2147483647 & 17-2147483647
[0] MPI startup(): Scatterv: 1: 0-2147483647 & 0-2147483647
[5] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[1] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[7] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[2] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[6] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[3] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[13] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[4] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[9] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[14] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[11] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[15] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[8] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[12] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[10] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 10691 ip-10-0-0-189 0
[0] MPI startup(): 1 10692 ip-10-0-0-189 1
[0] MPI startup(): 2 10693 ip-10-0-0-189 2
[0] MPI startup(): 3 10694 ip-10-0-0-189 3
[0] MPI startup(): 4 10320 ip-10-0-0-174 0
[0] MPI startup(): 5 10321 ip-10-0-0-174 1
[0] MPI startup(): 6 10322 ip-10-0-0-174 2
[0] MPI startup(): 7 10323 ip-10-0-0-174 3
[0] MPI startup(): 8 10273 ip-10-0-0-104 0
[0] MPI startup(): 9 10274 ip-10-0-0-104 1
[0] MPI startup(): 10 10275 ip-10-0-0-104 2
[0] MPI startup(): 11 10276 ip-10-0-0-104 3
[0] MPI startup(): 12 10312 ip-10-0-0-158 0
[0] MPI startup(): 13 10313 ip-10-0-0-158 1
[0] MPI startup(): 14 10314 ip-10-0-0-158 2
[0] MPI startup(): 15 10315 ip-10-0-0-158 3
[0] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)
[0] MPI startup(): I_MPI_DEBUG=6
[0] MPI startup(): I_MPI_HYDRA_UUID=bb290000-2b37-e5b2-065d-050000bd0a00
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=1
[0] MPI startup(): I_MPI_PIN_MAPPING=4:0 0,1 1,2 2,3 3
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After further troubleshooting, I believe the issue is a problem with the size of the derived datatypes my program creates and sends inter-nodally via TCP. The program runs fine with derived datatypes of the following sizes:
!** Datatype sent call mpi_type_contiguous(54,MPI_REAL, ddt_send(i),ierr) call mpi_type_commit(ddt_send(i),ierr) !** Datatype received call mpi_type_indexed(40,**54 elements per block**, **array of displacements spanning 0-25110 to 810-31050 depending on the process**, & MPI_REAL, ddt_recv(i),ierr) call mpi_type_commit(ddt_recv(i),ierr) old = ddt_recv(i) call mpi_type_create_resized(old,int(0,kind=MPI_ADDRESS_KIND),& int(3456*40,kind=MPI_ADDRESS_KIND), ddt_recv(i),ierr) call mpi_type_commit(ddt_recv(i),ierr)
The program hangs at MPI_Waitall when working with these sizes:
!** Datatype sent call mpi_type_contiguous(54*60,MPI_REAL, ddt_send(i),ierr) call mpi_type_commit(ddt_send(i),ierr) !** Datatype received call mpi_type_indexed(60,**54 elements per block**, **array of displacements spanning 0-30240 to 810-31050 depending on process, & MPI_REAL, ddt_recv(i),ierr) call mpi_type_commit(ddt_recv(i),ierr) old = ddt_recv(i) call mpi_type_create_resized(old,int(0,kind=MPI_ADDRESS_KIND),& int(3456*60,kind=MPI_ADDRESS_KIND), ddt_recv(i),ierr) call mpi_type_commit(ddt_recv(i),ierr)
The program runs successfully with OpenMPI. Additionally, MPI_Test shows that all messages successfully send using MPI_Isend, but only about 2/3 are successfully received by MPI_Irecv.
Perhaps it's a TCP tuning issue specific to Intel MPI? Any pointers on proper TCP settings? Here are my current settings:
$ sudo sysctl -p net.ipv4.ip_forward = 0 net.ipv4.conf.default.rp_filter = 1 net.ipv4.conf.default.accept_source_route = 0 kernel.sysrq = 0 kernel.core_uses_pid = 1 net.ipv4.tcp_syncookies = 1 kernel.msgmnb = 65536 kernel.msgmax = 65536 kernel.shmmax = 68719476736 kernel.shmall = 4294967296 kernel.hostname = ip-10-0-0-231 net.core.netdev_max_backlog = 1000000 net.core.rmem_default = 124928 net.core.rmem_max = 67108864 net.core.wmem_default = 124928 net.core.wmem_max = 67108864 net.ipv4.tcp_keepalive_time = 1800 net.ipv4.tcp_mem = 12184608 16246144 24369216 net.ipv4.tcp_rmem = 4194304 8388608 67108864 net.ipv4.tcp_syn_retries = 5 net.ipv4.tcp_wmem = 4194304 8388608 67108864

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page