- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi guys,
I have a SLURM cluster setup with intelmpi and Ansys CFX.
Here are my settings for the jobs:
export I_MPI_DEBUG=5
export PSM_SHAREDCONTEXTS=1
export PSM_RANKS_PER_CONTEXT=4
export TMI_CONFIG=/etc/tmi.conf
export IPATH_NO_CPUAFFINITY=1
export I_MPI_DEVICE=rddsm
export I_MPI_FALLBACK_DEVICE=disable
export I_MPI_PLATFORM=bdw
export SLURM_CPU_BIND=none
export I_MPI_FABRICS=shm:tmi
export I_MPI_TMI_PROVIDER=psm
export I_MPI_FALLBACK=1
I have also the intelmpi 5.0.3 module loaded under Centos 7
And also the simulation starts but the traffic does not go trought ib0 interfaces.
This is the output from the debug:
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm and tmi data transfer modes
[8] MPI startup(): shm and tmi data transfer modes
[2] MPI startup(): shm and tmi data transfer modes
[10] MPI startup(): shm and tmi data transfer modes
[4] MPI startup(): shm and tmi data transfer modes
[12] MPI startup(): shm and tmi data transfer modes
[1] MPI startup(): shm and tmi data transfer modes
[9] MPI startup(): shm and tmi data transfer modes
[3] MPI startup(): shm and tmi data transfer modes
[15] MPI startup(): shm and tmi data transfer modes
[6] MPI startup(): shm and tmi data transfer modes
[14] MPI startup(): shm and tmi data transfer modes
[5] MPI startup(): shm and tmi data transfer modes
[11] MPI startup(): shm and tmi data transfer modes
[7] MPI startup(): shm and tmi data transfer modes
[13] MPI startup(): shm and tmi data transfer modes
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 12614 qingclinf-01.hpc.cluster {0,1,2,20,21}
[0] MPI startup(): 1 12615 qingclinf-01.hpc.cluster {3,4,22,23,24}
[0] MPI startup(): 2 12616 qingclinf-01.hpc.cluster {5,6,7,25,26}
[0] MPI startup(): 3 12617 qingclinf-01.hpc.cluster {8,9,27,28,29}
[0] MPI startup(): 4 12618 qingclinf-01.hpc.cluster {10,11,12,30,31}
[0] MPI startup(): 5 12619 qingclinf-01.hpc.cluster {13,14,32,33,34}
[0] MPI startup(): 6 12620 qingclinf-01.hpc.cluster {15,16,17,35,36}
[0] MPI startup(): 7 12621 qingclinf-01.hpc.cluster {18,19,37,38,39}
[0] MPI startup(): 8 12441 qingclinf-02.hpc.cluster {0,1,2,20,21}
[0] MPI startup(): 9 12442 qingclinf-02.hpc.cluster {3,4,22,23,24}
[0] MPI startup(): 10 12443 qingclinf-02.hpc.cluster {5,6,7,25,26}
[0] MPI startup(): 11 12444 qingclinf-02.hpc.cluster {8,9,27,28,29}
[0] MPI startup(): 12 12445 qingclinf-02.hpc.cluster {10,11,12,30,31}
[0] MPI startup(): 13 12446 qingclinf-02.hpc.cluster {13,14,32,33,34}
[0] MPI startup(): 14 12447 qingclinf-02.hpc.cluster {15,16,17,35,36}
[0] MPI startup(): 15 12448 qingclinf-02.hpc.cluster {18,19,37,38,39}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=shm:tmi
[0] MPI startup(): I_MPI_FALLBACK=1
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_DIST=10,21,21,10
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=qib0:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=8:0 0,1 3,2 5,3 8,4 10,5 13,6 15,7 18
[0] MPI startup(): I_MPI_PLATFORM=auto
[0] MPI startup(): I_MPI_TMI_PROVIDER=psm
But there is not traffic over infinabd
inet 10.0.2.1 netmask 255.255.255.0 broadcast 10.0.2.255
inet6 fe80::211:7500:6e:de10 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 121 bytes 23835 (23.2 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 118 bytes 22643 (22.1 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
I did chmod 666 on the /dev/ipath and /dev/infiniband* on the compute nodes.
the /etc/tmi.conf has the library.
Why are the sims running ok and will not over infinband. I can ping and ssh onver infinband but canno use it.
Thanks in advance.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Greetings,
I have now configured DAPL,
The sim starts I get the followin messages but still no traffic over Infiniband
[12] MPI startup(): DAPL provider ofa-v2-qib0-1s
[6] MPI startup(): shm and dapl data transfer modes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try removing these: I_MPI_DEVICE, I_MPI_FALLBACK_DEVICE
And set I_MPI_FALLBACK to 0.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have tried this. Also dapl, tmi and so on. It is strange , the ofed stack is installed, the ibstatus is ok , the nodes can ping each other, but the apps will not communicate. Are there some /etc/infiniband/openibd.conf settings to make.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The I_MPI_DEVICE setting conflicts with the I_MPI_FABRICS setting. (I_MPI_DEVICE is deprecated.)
What error message do you get when the MPI is forced not to fallback? (If it ran, something else is going on, maybe Ansys is overriding your settings.)
Try running the Intel MPI Benchmarks -- IMB-MPI1 in same directory as mpirun.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey guys,
I think that the HPC cluster was working from the start.
When I use perfquery (native infiniband port counter) I see that packets are goting trough. And I cannot see them over ifconfig.
I think the TCP-IP stack (the IPoIB) was designed for apps that would like to use the socket technology from Infiniband, and that the solver has recognized it use it natively over ib0 interface and the traffic is going trough infiniband stack.
I do not get the error. The simulation is running and I get this
[11] MPI startup(): shm and tmi data transfer modes
So although I get a lot of data over Ethernet I think these are only sockets with Ethernet IPs for exchanging info and the calculations are going over MPI instances trough Infiniband interfaces natively.
How could I also confirm / benchmark this?
perfquery
# Port counters: Lid 1 port 1 (CapMask: 0x200)
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrorCounter:..............0
LinkErrorRecoveryCounter:........0
LinkDownedCounter:...............0
PortRcvErrors:...................0
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
VL15Dropped:.....................0
PortXmitData:....................252512975
PortRcvData:.....................244666221
PortXmitPkts:....................1301352
PortRcvPkts:.....................1308427
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If it runs with I_MPI_FABRICS=shm,tmi and I_MPI_FALLBACK=0, then the messages are going over infiniband.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page