- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Im having some issues with the latest Intel mpi and efa on AWS instances.
I installed the Intel MPI from the install script found elsewhere in the support forums.
(https://software.intel.com/sites/default/files/managed/f4/92/install_impi.txt)
I grabbed the latest libfabric source and built that.
The instance already had AWS's libfabric from the efa setup install but its not in PATH/LD_LIBRARY_PATH for these tests.
[sheistan@compute-041249 ~]$ which mpiexec
/opt/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpiexec
[sheistan@compute-041249 ~]$ which fi_info
/nasa/libfabric/latest/bin/fi_info
[sheistan@compute-041249 ~]$ fi_info -p efa
provider: efa
fabric: EFA-fe80::4c2:2aff:fec7:ce80
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::4c2:2aff:fec7:ce80
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::4c2:2aff:fec7:ce80
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
when running on a single node life is good as expected:
[sheistan@compute-041249 ~]$ I_MPI_DEBUG=1 mpiexec --hostfile $PBS_NODEFILE -np 2 ./pi_efa
[0] MPI startup(): libfabric version: 1.9.0a1
[0] MPI startup(): libfabric provider: efa
compute-041249
compute-041249
pi is approximately: 3.1415926769620652 Relative Error is: -0.20387909E-05
Integration Wall Time = 0.005503 Seconds on 2 Processors for n = 10000000
but when two nodes are involved it hangs. In this case in a mpi_barrier() call.
[sheistan@compute-041249 ~]$ I_MPI_DEBUG=1 mpiexec --hostfile $PBS_NODEFILE -np 2 -ppn 1 ./pi_efa
[0] MPI startup(): libfabric version: 1.9.0a1
[0] MPI startup(): libfabric provider: efa
compute-041116
compute-041249
^C[mpiexec@compute-041249] Sending Ctrl-C to processes as requested
[mpiexec@compute-041249] Press Ctrl-C again to force abort
forrtl: error (69): process interrupted (SIGINT)
Image PC Routine Line Source
pi_efa 0000000000404724 Unknown Unknown Unknown
libc-2.26.so 00007F80892447E0 Unknown Unknown Unknown
libfabric.so.1.11 00007F8088D5A007 Unknown Unknown Unknown
libfabric.so.1.11 00007F8088D5B170 Unknown Unknown Unknown
libfabric.so.1.11 00007F8088D0B38D Unknown Unknown Unknown
libfabric.so.1.11 00007F8088D0A72E Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A7CAC90 Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A0EBF7B Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A7F5B2F Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A3F42C0 Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A052F7C Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A054138 Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A17F4E2 Unknown Unknown Unknown
libmpi.so.12.0.0 00007F808A05446E MPI_Barrier Unknown Unknown
libmpifort.so.12. 00007F808AF1573C pmpi_barrier Unknown Unknown
pi_efa 0000000000402EF9 Unknown Unknown Unknown
pi_efa 0000000000402E52 Unknown Unknown Unknown
libc-2.26.so 00007F808923102A __libc_start_main Unknown Unknown
pi_efa 0000000000402D6A Unknown Unknown Unknown
Thoughts on something to try or something I missed?
thanks
s
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As a followup to this: allowing outbound traffic for all traffic to 0.0.0.0/0 is apparently not
the same as all traffic to the security group allowing the traffic.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nasa Ames,
Could you please ensure that you performed all steps from AWS/EFA guide?
And please switch to updated Intel MPI installation script.
After Intel MPI installation on each node it should be enough to perform "source /opt/intel/impi/2019.4.243/intel64/bin/mpivars.sh" to enable IMPI environment.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Update:
Intel MPI installation section was added to AWS/EFA guide.
https://docs.aws.amazon.com/en_us/AWSEC2/latest/UserGuide/efa-start.html#efa-start-impi
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page