- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,We have buy Intel Fortran compiler and MPI Intel Lib to gain performances on a specific MPI run.We got 6 Opteron servers (2 proc of 8 core) witch QLogic Switch and Adapters.We try to launch the job with the following command :[bash]mpirun -genv I_MPI_FABRICS shm:tmi -n 48 ./prg.exe[/bash] But we got this error message :[bash]hpc-node3:10.1.ips_proto_connect: Couldn't connect to hpc-node2 (LID=0x0002:14.1). Time elapsed 00:00:30. Still trying... hpc-node3:15.0.ips_proto_connect: Couldn't connect to hpc-node2 (LID=0x0002:14.1). Time elapsed 00:00:30. Still trying... hpc-node3:14.0.ips_proto_connect: Couldn't connect to hpc-node2 (LID=0x0002:14.1). Time elapsed 00:00:30. Still trying... hpc-node3:11.0.ips_proto_connect: Couldn't connect to hpc-node2 (LID=0x0002:14.1). Time elapsed 00:00:30. Still trying... hpc-node3:12.1.ips_proto_connect: Couldn't connect to hpc-node2 (LID=0x0002:14.1). Time elapsed 00:00:30. Still trying... hpc-node1:10.0.ips_proto_connect: Couldn't connect to hpc-node3 (LID=0x0003:15.1). Time elapsed 00:00:30. Still trying... hpc-node2:15.1.ips_proto_connect: Couldn't connect to hpc-node3 (LID=0x0003:13.0). Time elapsed 00:00:30. Still trying... hpc-node1:11.1.ips_proto_connect: Couldn't connect to hpc-node3 (LID=0x0003:13.1). Time elapsed 00:00:30. Still trying... hpc-node1:10.0.ips_proto_connect: Couldn't connect to hpc-node3 (LID=0x0003:15.1). Time elapsed 00:01:00. Still trying... hpc-node2:15.1.ips_proto_connect: Couldn't connect to hpc-node3 (LID=0x0003:13.0). Time elapsed 00:01:00. Still trying... hpc-node1:11.1.ips_proto_connect: Couldn't connect to hpc-node3 (LID=0x0003:13.1). Time elapsed 00:01:00. Still trying... hpc-node1:10.0.ips_proto_connect: Couldn't connect to hpc-node3 (LID=0x0003:15.1). Time elapsed 00:01:30. Still trying... hpc-node2:15.1.ips_proto_connect: Couldn't connect to hpc-node3 (LID=0x0003:13.0). Time elapsed 00:01:30. Still trying... hpc-node1:11.1.ips_proto_connect: Couldn't connect to hpc-node3 (LID=0x0003:13.1). Time elapsed 00:01:30. Still trying... MPID_nem_tmi_vc_connect: tmi_connect returns 11 Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(527)....: Initialization failed MPID_Init(171)...........: channel initialization failed MPIDI_CH3_Init(86).......: MPIDI_CH3_VC_Init(200)...: MPID_nem_vc_init(1521)...: MPID_nem_tmi_vc_init(602): (unknown)(): Other MPI error MPID_nem_tmi_vc_connect: tmi_connect returns 11 Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(527)....: Initialization failed MPID_Init(171)...........: channel initialization failed MPIDI_CH3_Init(86).......: MPIDI_CH3_VC_Init(200)...: MPID_nem_vc_init(1521)...: MPID_nem_tmi_vc_init(602): (unknown)(): Other MPI error MPID_nem_tmi_vc_connect: tmi_connect returns 11 Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(527)....: Initialization failed MPID_Init(171)...........: channel initialization failed MPIDI_CH3_Init(86).......: MPIDI_CH3_VC_Init(200)...: MPID_nem_vc_init(1521)...: MPID_nem_tmi_vc_init(602): (unknown)(): Other MPI error rank 31 in job 1 hpc-node1_55838 caused collective abort of all ranks exit status of rank 31: killed by signal 9 rank 14 in job 1 hpc-node1_55838 caused collective abort of all ranks exit status of rank 14: return code 1 rank 1 in job 1 hpc-node1_55838 caused collective abort of all ranks exit status of rank 1: return code 1 [/bash]If I launch the job over 2 nodes, sometimes the jobs start and finish normaly. If I increase the number of node (from 3 to 6), it crashes everytime. The previous crash example have been generated by a three nodes run.The nodes concerned by the "couldn't connect" problem are not the same one : we cannot isolate a specific node that could be the source of the problem.The file /etc/tmi.conf contains the following :[bash]# TMI provider configuration # # format od each line: # <path/to/library> # # Notice: the string arguments must have at least one character inside # mx 1.0 libtmip_mx.so " " # comments ok psm 1.0 /opt/intel/impi/4.0.1.007/intel64/lib/libtmip_psm.so " " [/bash] The job runs well when compiling with gfortran and executed by openmpi in psm mode.I don't know what to do. Did somehome had some ideas?Regards.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It maybe an issue of the PSM library. I would suggest starting with a smaller scale run (say np=8) and use I_MPI_FABRICS=tmi only to see if the issue still exists. If no, then gradually scale that up.
TMI library may has some limitations on number of connections.
To understand what Provider has been chosen add I_MPI_DEBUG=5 to the command line.
To change provider you can use I_MPI_TMI_PROVIDER env varible (set to mx or psm).
Setting TMI_DEBUG to a non-zero value enables the output of various debugging messages.
I_MPI_FABRICS set to shm:tcp will rather use eth0. To use Infiniband you need to have DAPL library and properly configured /etc/dat.conf and use I_MPI_FABRICS=shm:dapl.
Please let us know if you think that there is an issue in tmi library and how we can reproduce it.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems that the PSM library doesn't scale well with the psm_ep_connect() call. You can increase the timeout value by setting the environment variable TMI_PSM_TIMEOUT. The default value is 120 (seconds).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To use IPoIB you need to configure it properly. IB interface need to have IP address. You can check existence of IP address of ib interface by 'ifconfig -a' command. It should look like:
inet addr:192.168.2.20
To use it in Intel MPI Library you need to add '-genv I_MPI_FABRICS shm:tcp' and '-genv I_MPI_NETMASK=ib'.
Setting I_MPI_DEBUG=2 may help you to check what fabric has been chosen at run-time.
BTW: you may also use SDP (Sockets Direct Protocol) library(if you have one) to impove performance. You just need to add '-genv LD_PRELOAD libsdp.so' to mpiexec or mpirun command line.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi we are getting cannot load default tmi provider.
/etc/tmi.conf has following
# TMI provider configuration
#
# format od each line:
# <name> <version> <path/to/library> <string-arguments>
#
# Notice: the string arguments must have at least one character inside
#
mx 1.0 libtmip_mx.so " " # comments ok
psm 1.0 libtmip_psm.so " "
we are using Qlogic cards, and intel mpi 4.1.0, how to troubleshoot further
best rgds
amit

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page