Software Archive
Read-only legacy content
17060 Discussions

PHI w/ MPICH (3.0.3) ch3:nemesis:scif

Eric_B_
Beginner
335 Views
I'm trying to get MPICH (3.0.3) and SCIF working. I'm using the tests from osu_benchmarks(from mvapich2 tarball) as a set of sanity checks, and I'm running into some unexpected errors. One example: running osu_mbw_mr works sometimes, and then fail on the next try. The printout from two successive runs as well as the hosts file are below. Compiler is latest (13.1.1) icc; latest MPSS (2-2.1.5889-14); Centos 6.4. This particular test should be setting up four pairs of processes, each with one element on the host, and one on the PHI, and try to communicate between the two (node 0<->4, ...) Things seem to be more stable with one or two pairs of processes, but that's not really the desired use case... Thanks for any help! [eborisch@rt5 osu_benchmarks]$ mpiexec -map-by rr -n 4 native/osu_mbw_mr : -n 4 mic/osu_mbw_mr # OSU MPI Multiple Bandwidth / Message Rate Test # [ pairs: 4 ] [ window size: 64 ] # Size MB/s Messages/s 1 0.23 230138.66 2 0.46 229579.06 4 0.92 231201.97 8 1.85 231515.16 16 3.63 226781.39 32 6.92 216285.18 64 12.65 197678.16 128 25.21 196946.20 256 50.46 197106.54 512 86.11 168184.28 1024 132.69 129577.13 2048 180.60 88183.67 4096 179.81 43898.89 8192 358.07 43710.21 16384 696.33 42500.74 32768 1364.41 41638.46 65536 2737.42 41769.74 131072 4657.86 35536.68 262144 6160.59 23500.77 524288 6584.39 12558.73 1048576 6690.91 6380.95 2097152 6782.58 3234.18 4194304 6789.25 1618.68 ( Note: Seems ~reasonable for pushing data one direction over 16x PCIe 2.0) [eborisch@rt5 osu_benchmarks]$ mpiexec -map-by rr -n 4 native/osu_mbw_mr : -n 4 mic/osu_mbw_mr # OSU MPI Multiple Bandwidth / Message Rate Test # [ pairs: 4 ] [ window size: 64 ] # Size MB/s Messages/s 0: 5: 00000051: 00000060: readv err 0 0: 5: 00000052: 00000060: readv err 0 0: 5: 00000053: 00000060: readv err 0 Fatal error in PMPI_Barrier: Other MPI error, error stack: PMPI_Barrier(426)................: MPI_Barrier(MPI_COMM_WORLD) failed MPIR_Barrier_impl(283)...........: MPIR_Barrier_or_coll_fn(121).....: MPIR_Barrier_intra(83)...........: MPIC_Sendrecv(209)...............: MPIC_Wait(563)...................: MPIDI_CH3I_Progress(367).........: MPID_nem_mpich_blocking_recv(894): state_commrdy_handler(175).......: state_commrdy_handler(138).......: MPID_nem_scif_recv_handler(115)..: Communication error with rank 5 MPID_nem_scif_recv_handler(35)...: scif_scif_read failed (scif_scif_read failed with error 'Success') MPIR_Barrier_intra(83)...........: MPIC_Sendrecv(209)...............: MPIC_Wait(563)...................: MPIDI_CH3I_Progress(367).........: MPID_nem_mpich_blocking_recv(894): state_commrdy_handler(175).......: state_commrdy_handler(138).......: MPID_nem_scif_recv_handler(115)..: Communication error with rank 5 MPID_nem_scif_recv_handler(35)...: scif_scif_read failed (scif_scif_read failed with error 'Success') MPIR_Barrier_impl(294)...........: MPIR_Barrier_or_coll_fn(121).....: MPIR_Barrier_intra(83)...........: MPIC_Sendrecv(209)...............: MPIC_Wait(563)...................: MPIDI_CH3I_Progress(367).........: MPID_nem_mpich_blocking_recv(894): state_commrdy_handler(175).......: state_commrdy_handler(138).......: MPID_nem_scif_recv_handler(115)..: Communication error with rank 5 MPID_nem_scif_recv_handler(35)...: scif_scif_read failed (scif_scif_read failed with error 'Success') MPIR_Barrier_impl(308)...........: MPIR_Bcast_impl(1369)............: MPIR_Bcast_intra(1199)...........: MPIR_Bcast_binomial(220).........: Failure during collective =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== [proxy:0:1@mic0.local] HYDU_sock_write (./utils/sock/sock.c:291): write error (Broken pipe) [proxy:0:1@mic0.local] stdoe_cb (./pm/pmiserv/pmip_cb.c:63): sock write error [proxy:0:1@mic0.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:1@mic0.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [mpiexec@rt5] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec@rt5] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@rt5] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion [mpiexec@rt5] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion Here's the host file (unchanged between runs): host:4 mic0:4 binding=user:4,8,12,16
0 Kudos
1 Reply
Frances_R_Intel
Employee
335 Views

I'm going to look into this and get back to you - unless someone else comes up with a solution first.

0 Kudos
Reply