dapl with MPSS 3.5 and Qlogic HCA

Michael_R_ · ‎05-27-2015

I am running MPSS 3.5 and OFED+ 7.3.1.0.12 and have 2 nodes with Phi cards and Qlogic. I believe that if I set I_MPI_FABRICS to use either tcp or tmi everything works, but I've heard that dapl is faster and I'm having problems getting that to work everywhere. It works when MPI tasks are either only on the hosts or only on cards in a single host. If there are tasks on the host and a card it appears to have problems connecting to the IP address that is added during the ofed-mic service startup (192.0.2.0/24). I noticed that I am unable to ping that address on a card from the host or vice versa. On the other hand was was able to run dtest from the dapl-utils package and once I asked for the correct providers it passed the test and appeared to use those IPs for connection.

Can anybody tell me if I should be able to ping the 192 address between a card and the host?

Can anybody tell me which dapl providers should get used on each end of the connection between a host and a card with Qlogic HBA and Intel's OFED+?

What else should I be doing to debug this?

Thanks,

Mike Robbert

Artem_R_Intel1 · ‎05-28-2015

Hi Michael,
'tmi/shm:tmi' is the recommended fabric for Intel True Scale (aka QLogic) IBAs. 'dapl' fabric may work unstable and nonoptimal on such configurations.

Michael_R_ · ‎05-28-2015

That is good to know. Are there any published documents on expected latency and bandwidth numbers? i.e. How do we know if this is configured properly? Because right now the numbers I'm seeing are nowhere near those published in this Colfax paper ( http://research.colfaxinternational.com/post/2014/03/11/InfiniBand-for-MIC.aspx). I know they were using Mellanox, but shouldn't we expect Intel's own fabric to work at least as well when using their Phi cards? It looks like our latency with larger message sizes is going through the roof and the bandwidth is considerably less as well.

If we want good communication performance do we need to swap the cards out with Mellanox or are we missing some tunings?